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The application of objective techniques to the study of delinquent 
and anti-social personalities has long since lost its novelty. From 
Fernald’s attempt to measure the will-power of reformatory inmates 
in 1912? to the prolonged experimentation of the Character Education 
Inquiry,‘ research-workers have recorded a creditable series of ingeni- 
ous endeavors to discover objective test materials which will differ- 
entiate problem-behavior children and youths from their more normal 
fellows. Hildreth’s bibliography of mental tests and rating scales® 
lists nearly four hundred measures of personality and character, a 
considerable part of which bear directly upon this problem. Space 
will not permit here a review of the experiments of Cady' and Rauben- 
heimer’® in California, Schwesinger in New Jersey,'* Ruch and Cushing 
in Iowa,!! Voelker,'* Lentz,* Slawson,'* and Washburne’* in New York, 
Haggerty and Olson in Minnesota,® and others who have done out- 
standing pioneer work in this field. Too often, however, the history 
of these and similar studies has been a discouraging record of the 
careful devising of tests found to differentiate a particular group of 
delinquents from the controls against which they were matched, only 
to discover on further applicaton to other groups or in slightly differ- 
ent settings that the supposed discriminative power of these instru- 
ments has dwindled to negligible proportions. 

The present experiment represents an endeavor to select from a 


fairly catholic assortment of group tests those devices and test-items 
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which will distinguish delinquents or behavior-problem boys when 
applied to various experimental groups, both in reformatory institu- 
tions and in ordinary public schools. While two of the ten types of 
test employed can lay claim to originality, neither of these was among 
the four selected to compose the final battery. The significance of 
the study lies, therefore, in the unusually extended validation of the 
sub-tests and their component items, rather than in the introduction 
of novel techniques. The result is the assembling of a test battery 
of considerable reliability and discriminative power which is yet 
entirely practicable for use under normal public school conditions. 


EXPERIMENTAL GROUPS AND CRITERIA 


The tests selected for investigation were submitted to experimental 
verification upon three separate and distinct problem-behavior groups 
and several sub-groups. The major experimental group consisted of 
three hundred four boys selected by the principals and counselors of 
twelve junior high schools in San Francisco and Berkeley as being 
their most serious disciplinary problems. These were matched against 
three hundred eight unselected boys from the same schools and of the 
same average IQ, namely 95.1. So marked was the over-age tendency 
in the problem group that it was found impossible to equate strictly 
for chronological age, but as an offset to this, the problem group was 
allowed the advantage of six months’ superiority in mental age as 
well as CA. 


TaBLE I.—Mzans oF PROBLEM AND ConTROL Groups IN AGE AND INTELLIGENCE 





N;} CA} MA/ IQ 





San Francisco-Berkeley Junior High Schools. 


i i oss a conepehencctesaebachet ae 304/179 .8)170.5)/95.1 

cd cake ain 004. eu hae ed teehee tee eeeeeeeeees 308/173 .0)164.9/95.1 
Whittier State (Reform) School. 

Whittier boys............., ebhe BEEAD AM iee «ss banda 98)177 .4/161.8)/91.7 

Control group (second sampling)....................6. 98)177 .0|160.7/91 .3 

















In the light of the findings from the San Francisco-Berkeley try-out, 
four of the original tests were eliminated from further consideration, 
and a tentative scoring scheme was adopted for the remaining six. 
This foreshortened battery was next administered to the entire male 
enrollment of a small junior high school in Oakland, California, and 
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the test scores obtained were correlated with teacher ratings of these 


pupils on the Haggerty-Olson-Wickman Behavior Rating Schedule.* © 


Correlations were also determined between these test scores and 
teacher ratings and the pupils’ performante on the Terman Group 
Test of Mental Ability, as shown in Table II. It is interesting to 
note that the behavior battery revealed a zero or negative correlation 
with pupils’ showing on intelligence tests, both in the Oakland Whittier 


groups and regardless of whether group or individual tests were used. - 


On the other hand, teachers’ ratings of their behavior showed a sub- 
stantial relationship with intelligence. 


TasBLe II.—CorreEvLations or ScorEs ON THE BEHAVIOR TEST BATTERY WITH 
TEACHERS’ RaTINGs OF BEHAVIOR AND INTELLIGENCE Trst Scores 





Correlation with 


Correlation with 





teachers’ ratings} score on battery 
Group tested N of pupils’ ol tehatter 
behavior tests 
Battery of six behavior tests | Oakland Junior High School|104; .36 + .057 .87* + .009 


Teachers’ ratings on Hag- 
gerty-Olson-Wickman 








Di céstdesatiaceeee Oakland Junior High School|104; .92* + .001 .36 + .057 
MA on Terman group test...| Oakland Junior High School|104) ........... —0.008 + .072 
IQ on Terman group test....| Oakland Junior High School|/104) .48 + .054 —0.006 + .073 
TQ on Stanford-Binet........ Whittier and control Tn + wees daevceds —0.14 + .050 











* Reliability coefficients. 


Boys in the Oakland school who rated one or more PE above the 
mean in undesirable behavior as recorded on the Haggerty-Olson- 
Wickman Schedule were then taken as a “‘problem”’ group, with those 
rated one or more PE below the mean serving as controls, and the 
sigma differences in the test scores of the two groups were computed. 

The third criterion of test performance was obtained by giving the 
six tests to a group of ninety-eight boys in the Whittier State School 
for delinquents and comparing their papers with those of a group of 
equal age and intelligence drawn from the control populations in the 
Bay area. The showing of the various sub-tests on the three com- 
parisons may be seen in Table IV, following. 

As a result of the Oakland and Whittier experiments, a fifth test, 
Cheating, was dropped, whereupon an item analysis was made of the 
remaining five. With the exception of words in the False Vocabulary 
List, no item was retained in any test which did not show a difference 
of two or more PE between the problem and control groups in both 
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the San Francisco and Whittier comparisons, or else a difference of 
like sign in both instances, together with a difference of at least two 
PE for the two groups combined. Needless to say, this proved an 
exceedingly rigorous standard, and resulted in materially shortening 
the tests. 


DFSCRIPTION OF THE TESTS 


The tests employed in this study included three which had proved 
effective in similar researches, and seven others designed to measure 
factors which the literature of the field suggested as promising leads. 
Listed in the order in which they were eliminated from the experiment, 
or in approximate ascending order of merit in their final revised form, 
these were as follows: 


1. Offense Ratings.—Subjects were asked to indicate the relative gravity of 
ten graded offenses. This test, found moderately discriminative by Rauben- 
heimer?® and Ruch"! yielded identical average scores for the problem and control 
boys of our main experimental group, and was therefore discarded. 

2. Distractibility—Scores on a multiplication test made with and without the 
distraction of hearing the experimenter reading lively selections aloud, showed only 
one sigma difference between problem and control boys, and the test was rejected. 

3. Suggestibility—A vivid narrative was read aloud by Dr. Loofbourow, after 
which the boys were given a true-false test of twenty-five items, largely leading 
questions implying circumstances not included in the text. A difference of 1.5 
sigma indicated the problem boys to be somewhat more suggestible than the con- 
trols. However, as this difference was the opposite of that found by Ruch and 
Cushing for their Aussage test with delinquent girls,** and as differences in the 
oral reading of different examiners might seriously affect the results, this test also 
was eliminated. 

4. A test of choices, or willingness to forego present pleasure for future advantage, 
consisted of eight such items as, 


Which would you rather have, ten cents right now or twenty-five cents a 
week from today? 


A difference of 1.4 sigma went to confirm the findings of Dr. John Washburne"® 
that disciplinary problem children are notably lacking in inhibition and thought 
for the future. The items in the present test, however, were not particularly well 
constructed, and the occurrence of zero scores for more than half of the subjects 
led to its abandonment. 

5. Cheating.—The test employed was patterned after those of Carroll, Cady, 
and others. The subjects were called upon to perform tasks which would be 
impossible if they obeyed instructions to keep their eyes closed. After yielding 
the highly satisfactory difference of 5.1 sigma when applied to the San Francisco 
groups, this device gave zero or negative correlations on both the Oakland and 
Whittier comparisons. We were therefore reluctantly obliged to reject it. 

6. Morbid Imagination.—This test was perhaps the only entirely original mem- 
ber of the battery. It consisted of thirty six items, in each of which the subject 
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was asked to choose what he considered the most likely explanation of the incident 
described; e.g., 
A man is standing on the sidewalk. He is very pale and nervous. 


He is sick. 

He is a dope fiend. 

He has just heard some very bad news. 

He is a thief, afraid he is going to be arrested. 


This test was the last to be discarded, as it showed the problem boys tending 
to select the more lurid alternatives to the extent of differences of from 1.5 to 3.8 
sigma between the problem and control groups on all comparisons. It was elim- 
inated from the final battery only on the ground that a gain of less than two points 
in the coefficient of validity seemed insufficient to justify the added time and 
expense required to give this section. 

7. False Vocabulary.—This is a form of overstatement test in which the subject 
is asked to mark the words of which he knows the meaning. In the list of one 
hundred ‘‘words” are thirty false words, e.g., ladome, sanilent, simmuck. The 
number of these checked constitutes the score. The list used, patterned after one 
devised by Professor Robert Carroll of Pennsylvania State College, showed sub- 
stantial discrimination in all groups, and was retained unchanged. 

8. Social Attitudes.—This device of Raubenheimer’s, which was found valid by 
Ruch and Cushing also, proved its worth once more in the present study. Typical 
items are: 


Hosogs 


They are lazy and dirty. 

They don’t have to work. 

They have a pretty good time. 
They are kind of friendly fellows. 


POLICEMEN 


It is fun to fool them. 

They have it in for the kids. 
They are glad to help you. 
They are just big bluffs. 


Item analysis made it possible to shorten this test from twenty-four to sixteen 
items. 

9. Courtesy.—This test, adapted from Hartshorne and May,‘ was designed as a 
measure of ‘“‘unnecessary lying.’”’ The subject is tempted to profess a degree of 
rectitude which is highly improbable; for example, 


Do you disobey any of the rules at your school? 
Hardly ever. Sometimes. Almost always. 


Do you return borrowed property promptly? 
Hardly ever. Sometimes. Almost always. 


Do you act greedily by taking more than your share of anything? 
Hardly ever. Sometimes. Almost always. 
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The makers of the test presumed that the problem or maladjusted child would 
be more disposed to claim excessive virtue than the normal boy or girl. The 
present study found the opposite to hold true in all groups compared. To cite an 
extreme instance, twenty-eight of the control as opposed to seventeen of the prob- 
lem boys declared that they ‘“‘almost always” reported the number of a car they 
saw speeding. Of forty-four original items, only five behaved in the expected 
manner, but by reversing the scoring, differences of from one to nine sigma were 
obtained. Item analysis led to the elimination of twenty-three items and a scoring 
of the remainder on the basis of the most highly differentiating response. In 
fourteen of the twenty-three items retained, the significant response was neither 
hardly ever nor almost always, but sometimes. For example, on each of the three 
items cited above from two to three times as many problem as control boys 
responded sometimes. 

10. Adjustment Questionnaire.—Profiting by the item counts of Slawson with 
delinquent boys in New York state,’ a selection of eighty-one questions was made 
from the familiar Woodworth-Cady and Matthews inventories. For example, 

Do you ever have the same dream over and over? Yes. No. 
Do you ever wish that you were dead? Yes. No. 

To these were added eighteen questions on school adjustment drawn from the 

Jackson-Symonds Adjustment Survey,’ such as 


Are you given a chance to show what you know in class? Yes. No. 


So thorough had been the validation by previous experimenters that only nine 
items were eliminated in the course of the present study, leaving a net total of 
ninety questions. 


PERFORMANCE OF THE TESTS 


Following the item analysis described, a weighting was determined 
for each of the four tests in the final battery by comparing the per- 
formance of each with that of the Adjustment Questionnaire, as 
follows: 

CRoaub-test , Fauestionnaire 


Caiscitinnestes O sub-test 


Tasuie III.—DeERIvATION oF WEIGHTINGS ADOPTED FOR FINAL BATTERY 


Weighing factor = 

















Critical ratio Weighting of sub-tests 
o(P+C) or Difuwr-& 
Test °Diff. Based on group 
Weights 
San oe San oe San pati adopted 
Weinihien Whittier Soenninta Whittier Secettin Whittier} Both 

False vocabulary. . 5.3 6.1 2.6 5.3 0.86 1.11 0.99 1 

Social attitudes... . 3.6 2.7 6.1 2.9 2.49 1.38 1.94 2 

“Courtesy”’....... 2.9 4.7 4.2 4.7 2.55 1.92 2.24 2 

pene 1.8 0.6 6.7 8.5 1.00 1.00 1.00 1 
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The CR, or Critical Ratio, is in each case the difference between the 
mean scores of the problem and control groups, expressed in terms of 
standard errors of that difference. The weights as thus determined 
may be seen in Table III. 


The reliabilities of the six sub-tests and battery were found to be 
as follows: 








N r 
RNs Siiatica ss phos dans< Wiad kb occa u anche” 637 | .871 + .009 
NN oS os. ws oe sivinck awash ba ee be eas 637 | .931 + .005 
IIs cn 56 0nd Kilwa 04skbs ceeneeargumenen 339 | .881 + .009 
EES IS SE BEA gy RT A = TU 405 | .883 + .015 
tll so dian need naddie keaes suka e os eucsente 447 | .810 + .O11 
Soria 5 a's aGeds onks chy icey av ede eesewoe 637 | .885 + .009 
III. i. 5 dca a cwks do bass Ones eh cunei 637 | .867 + .009 











These coefficients were obtained by the method of split-halves and the 
Spearman-Brown formula, and are based on the tests in their original 
form. The revised battery of four tests yielded a reliability coefficient 
of .95 + .01. The latter, however, was based on the same groups used 
for the selection of items, so an independent determination might be 
expected to approximate .90. 

The major constants for the six sub-tests and for the battery as a 
whole when applied to the several experimental groups are shown in 
Table IV. In endeavoring to evaluate the magnitude of the differ- 
ences and the bi-serial r’s indicated, it should be borne in mind that 
boys in the control groups for both the San Francisco and the Whittier 
studies were in no sense selected for good behavior but merely a random 
sampling of those not specifically designated by the school authorities 
as disciplinary problems. Matching of these control boys with the 
problem cases on the basis of age and intelligence resulted in obtaining 
control groups which were themselves largely composed of near-prob- 
lem boys. This was pointed out repeatedly by the principals and 
counselors. Needless to say, this greatly reduced the validity coeffi- 
cients obtained. 

In order to throw some light upon how well the test might be 
expected to discriminate between groups widely spaced in behavior 
and not equated for intelligence, the San Francisco and Berkeley prin- 
cipals were asked to choose from the control group those boys who were 
the best behaved, and from the problem group those who were the 
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TaBLE IV.—PERFORMANCE OF THE SEVERAL TEstTs PrRi0R To ITEM ANALYSIS £ 
Number Mean score , 
. Diff. 
Test and experimental gro ——_ Bi-serial r 
ai Prob- | Con- | Prob- | Con- | *Diff. 
lem trol | lem trol 
1. Offense ratings 
pints. & bali Mh gtebide gue 304 308 | 11.5 | 11.5 0.02 .001 + .05 
2. Distractibility. 
San Ditties nen-ph nae asinibdenen 304 308 | 21.3 | 21.0 1.0 .041 + .05 
3. Seger. 
I bina beubnw eee eues ee 304 308 4.2 4.0 1.5 .074 + .05 
4. Choices or foresight. 
te adi anh aie ae bawleetoae 304 308 1.44; 1.36) 1.4 .071 + .05 
5. Chsetigs. 
San Oe i a 304 308 | 16.1 | 12.9 5.1 .25 + .04 
a 0 ait iyo bi 23 30 8.9 9.0 |—0.2 | —.014 + .12 
EER EER MGS $2 Tar pe Ss 98 99 | 13.8 | 14.6 |—0.8 | —.08 + .08 
6. Morbid imagination. 
do nen nce ccdniiuen es 304 308 | 14.3 | 13.3 1.5 .07 + .05 
REESE IEE Gas rs 23 30 | 16.1 | 12.1 1.9 .29 +.11 
ies ee ins phe ah emia ae 98 99 13.7 3.8 .29 + .07 
7. False vocabulary. 
as a cha 6 wb dh @dneeaine 304 308 3.9 2.8 2.6 .13 + .04 
sc cts Ji Ld ebecbhe vabevboets 23 30 | 11.0 2.1 3.0 .50 + .09 
SES FT ON ena oe ee 98 99 5.9 6 5.3 .44 + .07 
8. Social attitudes 
Dadncdsevessesdeensaweee 304 308 4.6 3.1 5.1 .25 + .04 
LE ET LE eT ee MO a 23 30 4.5 2.6 2.6 .44 + .10 
EL. SeChbcdsecscsverekbeeedes 98 99 3.4 2.2 2.7 .24 + .07 
9. “Courtesy” test. 
ED cccccns es céeceeseeesty 304 308 | 21.7 | 21.2 0.9 .04 + .04 
th .piiceeseaekedeesvahevdce 23 30 | 23.1 | 17.2 2.1 .34 +.11 
i as le a he i 98 99 | 25.2 | 16.5 8.8 .64 + .06 
10. Questionnaire. 
iD: vcncoeecee de veesecee be 304 308 | 26.3 | 20.2 6.7 .33 + .04 
ian alles oe sin chines inital i tii 23 30 | 29.7 | 23.9 1.8 .30 + .12 
DO <;. 4c tercedsdbekschiessess 98 99 | 26.2 | 15.6 7.5 .59 + .06 
11. on net AN tests 5 to 10 inclusive, with origi- 
nal Longines 
es oe 304 308 |230.0 |170.6 | 12.1 .55 + .03 
I es co Mit be sick d kis hinaeeeas 23 30 |222.8 |173.3 2.7 .44 + .10 
ER SEES a ie ae a Fe Re 98 99 |214.3 |157.8 7.2 .57 + .04 
Best vs. Worst of 8S. F. group......... 52 60 |271.2 |150.8 | 24.6 .93 + .03 
12. Battery, tests 6 to 10 inclusive, weighted 
as in reference 7, page 35. 
ee ae a eine aw ainin 304 308 | 142.5) 92.1) 14.6 .64 + .03 
te £5. ccc at aemabeee nee 98 99 | 132.9) 83.9 9.8 .73 05 
13. min Bore Be ae 7 fo BE after item 
a and re w 
San Francisco...... - : ag pede Kcbeee 100 100 | 68.2 | 31.9 | 16.8 .73 + .05 
Dh d. Sottwenan¢eséeesehese ee 98 98 | 52.3 | 30.5 | 10.7 .77 + .05 























worst. This yielded a reduced group of sixty “best” and fifty-two 
“‘worst”’ boys. Applied to this group, the original battery of six 
tests without further refinement proved to discriminate to such an 
extent that no one of the “‘best”’ group exceeded the thirty percentile 
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score of the ‘‘worst.” At the same time the difference between the 
means of the problem and control scores leaped from twelve to twenty- 
five times the sigma difference, while the bi-serial r rose from .55 to 
.93 (see section 11 of Table IV). 


PREDICTION OVER A THREE YEAR INTERVAL 


As a check upon the validity of predictions afforded by the battery, 
a follow-up study was made in 1933 of the public school group employed 
in the original try-out three years before. It was found possible to 
locate and obtain fresh data on 130 of these boys. Sixty-two of these 
had been designated by the junior high school authorities in 1930 
as ‘‘problems” and 68 as non-problem cases. Of those who had 
scored in the highest (worst) quarter on the 1930 test, 28 per cent had 
court records and 34 per cent behavior clinic records by 193%. Of 
those who had scored in the second highest quarter, 6.4 per cent had 
court records by 1933 and 16 per cent had been referred to the clinic. 
Of those who had made lower than median test scores in 1930 not one 
had become a court or behavior-clinic case. 

Behavior ratings were also obtained for these 130 boys, as made 
by their high school counselors in 1933 on the basis of their disciplinary 
records since 1930. As a result, fifteen who had been designated 
problems in 1930 were rated non-problems in 1933, and 16 accounted 
non-problems in 1930 had entered the “problem” category by 1933. 
Of these 31 changes in classification over the three year period, 23, or 
74 per cent, were in the direction indicated by the test scores made 
in 1930. Moreover, whereas the test scores and behavior ratings 
in 1930 had correlated but .51 + .06 for this group (cf. section 11 of 
Table IV), these same 1930 scores proved to predict the behavior 
ratings of 1933 to the extent of .66 + 0.5. As a forecast of later 
behavior, the test scores thus appear more valid than the considered 
judgments of principals and counselors. 


THE REVISED BATTERY, OR PERSONAL INDEX TEST 


Tables V and VI, together with Fig. 1, indicate the discrimination 
obtained by an application of the final revised battery to another 
sampling of the San Francisco and Whittier groups. Inasmuch as the 
critical ratios and validity coefficients here shown were computed upon 
papers drawn from the same groups that determined the selection 
of items and weighting for the sub-tests, they may be expected to show 
a certain shrinkage when applied to wholly new and different groups. 
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TaBLE V.—DISCRIMINATION AND VALIDITY COEFFICIENTS OF THE FINAL BATTERY 
San Francisco-Berkeley Sampling of One Hundred Problem and One Hundred 














r 
Control Boys. Whittier Group of Ninety-eight against a Fresh Sample of 
Ninety-eight Control Boys I 
C 
Critical ratio 
ee Diff. p_c) Bi-serial r P 
Test ome 
San Soa San mae 
a Whittier | Mean OED Whittier | Mean 
False vocabulary...... 2.6 5.3 4.0 |.13 + .06).44 + .05| .29 
Social attitudes........ 5.1 2.9 4.0 |.25 + .05).28 + .05) .27 
**Courtesy”’ test....... 4.2 4.7 4.5 |.34 + .05).40 + .05| .37 
Questionnaire......... 6.7 8.5 7.6 |.35 + .05).65 + .04) .50 
Sus Fccae sic e'end 16.8 10.7 13.7 |.73 + .03|.77 + .03|) .75 























Taste VI.—DisrrisuTion or NINETY-EIGHT WHITTIER versus NINETY-EIGHT 
ContTrRoL Boys ON THE PERSONAL INDEX TEXT 

















23. 
Proposed critical score = 40. 


Frequency 
Score Whittier | Control 
Whittier | Control 

96-101 1 LI ae ee 98 98 
90— 95 3 iecak pbiicat §2.3 30.5 
84— 89 1 
78 83 3 
72- 77 5 
66— 71 5 1 
60— 12 
54- 13 ee 62.3 37.7 
48- 15 | ROE Be 51.1 29.8 
42- 11 tea ai die «x5 39.7 3 
36- 
30- 
24— 

18— 

12- 

6—- 
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Even after due allowance for such shrinkage, however, there is good 
reason to believe that the Personal Index battery here described will 
prove the most dependable and practicable group,test for the detection 
of behavior problems that has been assembled to date. 

In its final, revised form, the battery is only two-fifths as long as the 
original, and requires but forty-five minutes to give, even to slow 
groups. The test is practically self-administering, and the scoring 
simple. 
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A score of forty, or the lower quartile for the Whittier reform school 
group, serves conveniently as a critical score. In so far as the present 
findings prove typical, a dividing line drawn at this point will dis- 
tinguish three out of four delinquent boys, as against less than twenty 
per cent of others (see Table VI). In the present study, almost four 
out of five of the boys scoring forty or higher were acute disciplinary 
problems, and the fifth was often a borderline case. This indicates a 
degree of prediction comparable to that with which a good intelligence 
test forecasts achievement in school subjects. 

It should be remembered also that the present battery has been so 
selected as to yield little or no correlation with intelligence. But low 
intelligence is known to be an important factor in delinquency. It 
seems clear, then, that the use of this battery together with a good test 
of intelligence should serve to indicate with a high degree of accuracy 
those boys who are, or are likely to become, serious behavior problems. 
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By identifying these individuals upon entrance to junior high school, 
teachers and counselors may be enabled to take all possible steps to 


prevent unfortunate occurrences, and the further development of 
anti-social trends. 


SUMMARY 


1. Application of ten group tests to reformatory inmates and 
various groups of junior high school boys designated as disciplinary 
problems contrasted with public school boys of like age and intelligence 
not so designated has resulted in the assembling of an abbreviated 
battery of four group tests which can be administered in a forty-five 
minute period. 

2. The final battery, known as the Personal Index, shows a reli- 
ability in excess of .90, and a validity approximating .75. 

3. This battery has been so constructed as to yield no correlation 
with the Terman Group Test of Intelligence. 

4. Personal Index scores of 40 or higher have been found to char- 
acterize three out of four inmates of a well-known state reform school, 
as contrasted with only one in five junior high school boys not selected 
as behavior problems. 

5. A follow-up study of 130 boys revealed that of the 26 having 
court or behavior clinic records by 1933, every one had scored worse 
than the median of the group on the battery administered three 
years before. 

6. Where test scores and behavior ratings given by school authori- 


ties in 1930 disagreed, independent ratings in 1933 proved to confirm 
the original test scores in 74 per cent of the cases. 
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THE SCORING OF INDIVIDUAL PERFORMANCE ON 
TESTS SCALED ACCORDING TO THE THEORY OF 
ABSOLUTE SCALING 


CLARENCE W. BROWN, PHYLLIS BARTELME AND GERTRUDE M. COX 
University of California 


The general requirements to be satisfied by a procedure for scoring 
individual performance have been outlined by Thurstone.' He has 
devised a method of scoring which satisfactorily meets these require- 
ments for both mental age scales and educational test scales. This 
method also has been used for scoring individual performance on tests, 
such as the Gesell Developmental Schedule, after the items have been 
scaled on the basis of the absolute scaling technique.* The present 
paper is concerned with a scoring device to be used only on tests which 
have been so scaled. 

The advantages of absolute scaling have been adequately pointed 
out elsehwere.?* Two may be merely mentioned here. By this 
technique it is possible to obtain a stable unit of measurement; a unit 
which is consistent throughout the complete range of the ability being 
measured. It is further possible to refer all points on the continuum 
of ability to a common origin. If this origin is placed at the mean of 
one of the overlapping groups it is called a relative zero. In mental 
age scales it is possible to set this origin at the absolute zero of test 
intelligence. 

In the method of absolute scaling each item is given a value by 
which it can be located on the continuum representing the range of 
ability measured by the test. The easier items are placed at the lower 
end of the scale while items of progressively more difficulty are placed 
progressively higher on the scale. The question arises as to whether 
the absolute values of the items should be taken into consideration in 
scoring individual performance. Should each question passed con- 
tribute equally to the score or should each question contribute in pro- 
portion to its scaled value? If the items failed are also considered 
in the scoring procedure then the question would be: Should the pen- 
alty be the same for each item failed or should it be proportional to the 
scaled value of the item? If the absolute values of the test items are 
considered in the scoring the contribution from the items passed 





* Items of the Gesell Developmental Schedule have been scaled by Minnie L. 
Steckel. See Jour. Educ. Psychol., February, 1932, pp. 99-103. 
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becomes greater as the items fall higher on the scale; that is, the 
examinee receives more credit for answering a difficult item than for 
answering an easy one. For items failed the weighting is just the 
reverse; the penalty for failing an easy item is greater than the penalty 
for failing a more difficult one. The problem of obtaining an indi- 
vidual’s score then becomes one of determining that point on the 
continuum at whch the positive contribution from the correct items 
equals or balances the negative contribution from the incorrect items. 

One of the common characteristics of individual performance is the 
presence of inaccurate responses over a rather wide range of the scale. 
The individual does not usually correctly answer all of the items up to 
a certain point on the continuum and then fail all of the items beyond 
this point. There is a range of inaccuracy in which the percentage of 
error tends to increase as the scale is ascended until a point is reached 
beyond which all items are failed. The value which is usually accepted 
as being the most representative of the individual’s performance or 
ability is the scale value at which he works at fifty per cent efficiency 
or accuracy. In Thurstone’s procedure' the individual’s score is taken 
as “‘that scale value above which there are as many correct answers 
as there are wrong ones below it.” In the scoring procedure outlined 
below, and called the item value procedure, the absolute values of the 
test items are considered. The score of the individual is taken as that 
point on the scale at which the average deviation of the right items above it 
equals the average deviation of the wrong items below it. It is then the 
scale value at which the positive contribution from the items passed 
above it equals the negative contribution from the items failed below it. 

It is obvious that the score will be somewhere in the range of 
inaccuracy. The scale value of the midpoint of this range is selected 
as the most probable value of the score. This value is then corrected 
according to the relative contributions of the right answers above it 
and the wrong answers below it. 

The complete formula for obtaining the score is 
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in which X, = scale value of that test item below which all items are 
passed, 
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X» = scale value of that test item above which all items are 
failed, 
x = scale values of items passed above the midpoint of the 
range of inaccuracy, 
y = scale values of items failed below the midpoint of the 
range of inaccuracy, 
nz = number of items passed above the midpoint of the range 
of inaccuracy, 
my = number of items failed below the midpoint of the range 
of inaccuracy, 
| S = score of the individual. 
i! The midpoint of the range of inaccuracy is obtained from the first 


term me te ’. The average deviation of the items passed above this 





point is given by the term =z — or me. The average devia- 


tion of the items failed below this point is given by the term 
(=~) — Xy. The midpoint is then corrected by half of the 
difference between the two average deviations. When the midpoint 
is lower than the score value the correction is positive and when the 
midpoint is higher than the score value the correction is negative. 

By cancelling the values common to the two terms the foregoing 
formula reduces to 


_ 22, By 
aes = Qny 





Although in this form the value of the midpoint does not appear it is 
| still needed in order to determine the values of =z and Ly. 
| The obtaining of an individual’s score by this formula can be 
reduced to a few simple steps, which are illustrated in the following 
example. Table I gives the performance of a child on the Gesell 
schedule. The absolute scale value of the items are given in column 
one; the child’s responses to the items are given in column two. In 
the third column are indicated the two basic values X, and X,, and the 
approximate locations of the midpoint M, and the individual’s score S. 
Step 1.—Determine the midpoint of the range of inaccuracy. 
Find the scale value of that item below which all items are passed, 
X.,; and the scale value of that item above which all items are failed, 
X, a + X b 


X». The midpoint is half of the sum of these two values, 9 
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In the example, Table I, the value of X, is 15.14 and the value of X, 
21.82. The midpoint is then 18.48. 


Step 2.—Add the scale values of all items failed below the midpoint 
and divide this sum by two times the number of these items, =. 


In the example there are seven of these items with a total scale value of 
113.53; 


113.53 





= 8.109. 

Step 3.—Add the scale values of all items passed above the mid- 
point and divide this sum by two times the number of these items, oe 
There are four of these items with a total scale value of 80.83; 

£0.88 _ 16.104 


Step 4.—The individual’s score is obtained by adding the values 
from Steps 2 and 3. 


_ 22 zy _ me 
ae + On, 8.109 + 10.104 = 18.213 

It occasionally happens that an item has the same value as the 
midpoint of the range of inaccuracy. In this instance its value is 
entered in the summations. If the item is failed it is counted in with 
the items failed below the midpoint. If the item is passed its value 
is summed with the values of the items passed above the midpoint. 
Omitted items are not involved in computing the score value. 

In order to test the effect of using the actual scale values of the 
items in computing an individual’s score the performance of a number 
of children on the Gesell Developmental Schedule and the California 
First Year Mental Scale were scored by the item value procedure 
and by the Thurstone procedure, which utilizes only the number of 
items passed or failed. In those cases in which the performance 
became consistently worse from the beginning to the end of the range 
of inaccuracy, so there were no extremely low failures followed by con- 
sistent passing, or extremely high passes preceded by consistent 
failure, the two methods gave comparable scores. Case I, in the 
diagram, illustrates this type of performance. Two features of the 
performance should be noted. First, beginning with the item of 
the scale value 6.22 the proportion of failures gradually increases until 
from the eulav 7.49 all items are failed. Secondly, there are as many 
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items failed below the score value as there are items passed above it. 
It is apparent that in this type of performance the deviations of the 
passes above the midpoint are about equal to the deviations of the 
failures below it, and further, that the number of items passed above 
the midpoint are equal to the number of failures belowit. Under these 
conditions the two procedures should give approximately the same 
score. ‘The score from the item value procedure was 6.82 and from the 
Thurstone procedure 6.87. The difference is extremely small, being 
less than .17 of one month of mental age. 

If all test performance was of this type there would be little reason 
for considering the scale values of the items in computing an indi- 
‘vidual’s score. But in actual experience it is found that the relation 
between the passes and failures is an extremely variable one and only 
occasionally does the performance in the range of inaccuracy approach 
the conditions seen in Case I. In other types of performance the 
proportion of the items failed does not progressively increase in the 
range of inaccuracy. There may be an extremely low failure followed 
by consistent passing. ‘There may be an extremeiy high pass preceded 
by consistent failure. Different individuals, performing in the same 
range of inaccuracy, may pass the same number of items and fail the 
same number of items and yet because of differences in the scale values 
of the items passed and failed show considerable difference in their 
performance. Unless some cognizance is taken of the scale values, the 
individuals may be given exactly the same score. 

A comparison of the two scoring procedures in Cases II and III, 
see diagram, illustrates how the procedures operate differently when 
the performance within the range of inaccuracy is an irregular one. 
Case II represents the performance of a nine months old child on the 
California scale. By the Thurstone method the individual is given 
a score of 2.80, by the item value method a score of 2.59. An examina- 
tion of the performance within the range of inaccuracy reveals three 
failures below the midpoint, one being an extremely low one, which, 
because of its low value, should be given considerable weight in deter- 
mining the individual’s score. If each item contributes according to 
its value, the effect of an item failed below the midpoint will be greater 
as the item deviates further from the point of fifty per cent accuracy. 
The relative effect of the failure of the item at 2.20 on the score as 
obtained by the two procedures is one reason for the score being lower 
when computed by the item value method. 

Case III illustrates another irregular performance. There are two 
extremely high passes, preceded by a range of consistent failure. The 
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score from the Thurstone procedure is 3.06 and from the item value 
procedure 3.69. The difference in these two scores is in large measure 
due to the relative effect of these two high passes on the score 
as obtained by the two methods. When the actual value of the items 
is considered as in the item value method, the items at 4.19 and 4.39, 
having high values, tend to make the score higher. 

As two cases with different performances within the same range 
of inaccuracy were not found in either the Gesell or California data, 
Cases IV and V were especially constructed to illustrate the differences 
of the two procedures under this condition. These cases represent the 
responses of two seven year old children on a hypothetical scale. From 
the diagram it will be noticed that the range of inaccuracy is the same 
for the two cases, being from 5.6 to 8.6. Further, each individual 
passes the same number of items and fails the same number of items. 
If we compare only the failures below the midpoint or the passes above 
the midpoint, the performances are again alike, each having four 


_ failures and four passes, respectively. The responses of the two 


individuals are alike in all respects but one, and this is that the items 
passed and failed are not identical in the two cases. The items failed 
below the midpoint by Case IV tend to have lower scale values than 
those of Case V. Likewise, the items passed above the midpoint by 
Case IV tend to have lower scale values than those of Case V. From 
an examination of these two performances one would very probably 
conclude that the performance of Case V is slightly better than that of 
Case IV. The item value procedures give Case IV a score of 6.95 
and Case V a score of 7.25. The difference of .30 is over one-fourth 
of a year, being 3.6 months. When scored by the Thurstone procedure 
each individual receives a score of 7.1. Assuming that the perform- 
ances of the individuals as indicated are representative of their 
ability, a discrepancy of 3.6 months directly attributable to the 
method of scoring is of considerable significance. Although this con- 
dition of supposed equivalence represents a hypothetical situation, the 
illustration demonstrates very clearly how the finer differences in 
performance escape detection when the actual scale values of indi- 
vidual items are ignored in the computation of an individual’s score. 
A careful evaluation of the item value scoring procedure* indicates 
that it satisfactorily meets the requirements suggested by Thurstone.' 





* A comprehensive study is now in progress in which the item value method is 
being compared with other available scoring procedures. An individual analysis 
of a large number of cases is being made. 
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1. It is not necessary to have the same number of test items in 
each step of the scale. For example, in such a test as the Binet, it 
would not be necessary to have the same number of items at each age 
level. In the interest of differentiating capacity the items should be 


TABLE I.—PERFORMANCE OF CHILD N. K. on THE GESELL SCHEDULE 





Absolute scale value of | Child’s responses, R = | X, and X» = basic scores, 


test items in months right, W = wrong 


S = score, M = midpoint 





13.22 
13.22 
13.22 
15.14 
15.99 
15.99 
15.99 
15.99 
15.99 
15.99 
17.09 
17.09 
17.09 
17.59 
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18.218 
18.48M 


21.82X, 








rather evenly spaced throughout the entire range of the test. There 
should be no piling up of items near any given scale value, or long gaps 


on the scale where there are no items. 


The greater the number of 


items the greater the differentiating power of the test. 
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2. It is possible to omit test items at different levels of the scale, as 
the computation of the individual’s score in no way is veda upon 
omitted items. 

3. The scoring procedure can be applied to either (1) hes items 
that are scored right or wrong by the all-or-none principle, or (2) 
items which give a variable score, such as time per given amount of 
work, errors per given amount of work or amount of work per given 
time period, providing the possible variations in the score have been 
previously scaled and given a value on the continuum of ability being 
measured. 

4, It is not necessary to submit every subject to the whole range 
of the test. The two basic scores, X, and X,, mark off the range of 
inaccuracy. The individual should be tested sufficiently beyond this 
range, in both directions, to make certain that he can pass all items 
below X, and that he would fail all items beyond X;. 

5. By means of the method a rational score is obtained which can 


be used in comparing the performance of individuals or the performance . 


of groups of individuals. As all scores are expressed in terms of the 
same units and referenced to the same zero point, they are subject to 
algebraic manipulation. Group averages computed from such scores 
are more stable than those determined from the scores of unscaled 
tests. 

6. The arithmetic computations in determining individual scores 
are not difficult and the time involved is no greater than that demanded 
by other similar scoring devices. Considerable time can be saved by 
using a simple scoring form. 

7. The procedure is consistent with psychophysical methods. 
Items passed contribute according to their scale values, and undergo 
no translation into mental age units. The procedure can then be 
considered free from the logical errors involved in the mental age 
scoring procedures. Should the items be scaled in terms of age units 
a score can be translated into an approximate mental age equivalent. 
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A PERSISTENTJERROR IN THE NATURE-NURTURE 
CONTROVERSY 


STUART M. STOKE 
Mount Holyoke College 


There is no statute of limitations on the life of an error in the 
scientific world. Apparently it must continue until it can be displaced 
by criticism or by discovery of the error through more careful and 
accurate research. Unfortunately a clearing up of the error in the 
minds of a few experts does not terminate the harm, for usually a 
number of years are required to eradicate the false information from 
texts and popular writings. Until such an eradication is made, the 
old error continues to be disseminated to students and lay readers. 
Unfortunately such individuals rarely re-examine the concepts which 
lie outside their own fields of activity. Consequently they may carry, 
for the remainder of their lives, misconceptions which were honestly 
gained from books which spoke with authority and should have been 
more reliable sources. These individuals often act unquestioningly 
upon the basis of false information thus gained, and so continue the 
harm done. In such ways is error perpetuated long after the fallacy 
has been found. Part of the blame must rest upon the writers of texts 
for accepting the statements of previous texts too uncritically and, in 
some instances, for inadequate reading. However part of the difficulty 
is due to the fact that criticisms of research generally attract less notice 
than did the original studies which are being criticised. Such criti- 
cisms are frequently written in technical fashion, deal with difficulties 
which do not interest the lay reader and are often intended to appeal 
only to technologists. In other cases criticism fails because it is given 
dogmatically, or is in bad taste, or logic alone is used to support the 
criticism. Consequently published criticisms are apt to accumulate 
more dust than recognition, and are over long in reaching the public. 

An illustration of this can be seen in one phase of the nature- 
nurture controversy. This particular error has been given wide 
publicity for a number of years. Of the ten educational psychologies 
of recent vintage, and which I find on my shelves at the moment, eight 
of them quote, with approval, an error which was committed twenty 
years ago by one side of the controversy and one quotes a recent com- 
mission of the same error by the other side of this perennial hostility 


(with opposite findings, of course). The tenth text ignores the prob- 
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lem and thus avoids the error, without, however, clearing up the 
difficulty. It would be unfair to leave the impression that only writers 
of educational psychologies have quoted these errors with approval 
and without question. Writers in the fields of genetics and sociology 
have likewise accepted them. Here and there an occasional critic has 
voiced disapproval of the research methods involved and the findings 
obtained, but their criticisms have attracted little attention, for texts 
published within a year are still quoting findings made by the erroneous 
methods which are hereafter described. 

In brief the error against which this paper is directed consists of 
diagnosing one generation as feebleminded by one criterion and a 
later generation as feebleminded, or normal, by another, and then 
assuming that heredity produced the similarities found, or environ- 
ment caused the differences observed. Each party to the time-honored 
dispute of nature vs. nurture has used the method to its own advantage, 
although the hereditarians have used it more than the environmental- 
ists and their findings have had far more publicity. In neither case, 
however, is the method justifiable. 

Unfortunately one cannot be critical without being specific and 
therefore it will be necessary to examine a study on each side of the 
nature-nurture controversy and show how the errors committed 
prevent any conclusions either for or against the inheritance of feeble- 
mindedness. The better known of the two studies selected for criti- 
cism is the ‘‘Kallikak Family’? by H. H. Goddard. The second is 
more recent and forms only a minor part of an otherwise able piece of 
research work, ‘The Influence of Environment on the Intelligence, 
School Achievement and Conduct of Foster Children” by F. N. 
Freeman, K. J. Holzinger and B. C. Mitchell, and published in the 
“Twenty-seventh Year Book of the National Society for the Study of 
Education.”” A brief examination of the findings of the second, and 
less publicized, study will be made before a criticism of the method is 
taken up. 

Freeman and his co-workers found, in their investigation of foster 
children, a group of twenty-six children, ail of whose parents were said 
to be feebleminded. These children were adopted into better homes 
and upon later testing were found to be superior to the alleged mental 
status of their parents. The deductions made are as follows. 

“If feeblemindedness is to be regarded as a recessive character, 
the offspring of two feebleminded parents would all be feebleminded 
according to the Mendelian law. In the above group, however, only 
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four of the twenty-six children were found to be subnormal and two 
of these were members of the same family. It should also be noted 
that these four children were over 6.5 years when committed and 
tested only slightly below seventy . . . Although the ratings of the 
intellectual levels of the own parents are, in the majority of cases, only 
the estimates of the society investigators, it is probable that these 
parents are correctly classified as feebleminded. The findings on this 
group, therefore, appear to indicate that feeblemindedness is not to be 
regarded as a unit character which is inherited in accordance with the 
Mendelian law. On the contrary it is a trait which is subject to the 
modifying influence of environment.’’! 

If this finding can be accepted as valid, it contains tremendous 
significance to education and society. The solution of the problem 
of the feebleminded, or at least the moron group, would consist merely 
of providing proper environmental conditions for unfortunates during 
their early childhood. But the findings are a bit too startling. The 
sceptic wants to know how literally he can believe the following 
extract from the previous quotation: “‘Although the ratings of the 
intellectual level of the own parents are, in the majority of cases, only 
the estimates of the Society investigators, it is probable that these 
parents in question are correctly classified as feebleminded.” Dr. 
Freeman, in response to a question put in a public meeting, stated 
that the methods used in deciding whether these parents were feeble- 
minded or not were analogous to those used by Goddard. Our prob- 
lem, then, is to discover whether these methods are adequate for 
establishing the intellectual level of the parents, 7.e., whether these 
methods yield results with the parents which are comparable to the 
results obtained with their children by the use of intelligence tests. 
Clearly if the methods do not produce comparable results, then it is 
invalid to infer that environment had anything to do with the assumed 
improvement of the foster children over their parents. There may 
have been no intellectual improvement at all. 

To settle this problem demands an investigation of Goddard’s 
methods. In the story of the Kallikak Family, he has related the 
finding of Deborah, a girl in an institution for the feebleminded. A 
mental test showed her to be feebleminded. It was desired to trace her 
ancestry to discover whether her condition was hereditary. Accord- 





1 ““Twenty-seventh Year Book of the National Society for the Study of Educa- 
tion,” Part I, pp. 167. ; 
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ingly it was traced back to an illegitimate union of a soldier, Martin 
Kallikak, with a feebleminded girl in a tavern. Between the girl in 
the institution and the girl in the tavern stretch five generations and 
a bit over one hundred years. Included in these five generations, 
with their collateral branches, were four hundred eighty descend- 
ents of the illegitimate union. ‘‘One hundred forty-three of these, we 
have conclusive proof, were or are feebleminded, while only forty-six 
have been found normal. The rest are unknown or doubtful.”! Now 
just how did Goddard obtain his ‘‘conclusive proof” that these indi- 
viduals were feebleminded? Most of the cases were not given intelli- 
gence tests, for, of course, many were dead and many others were seen 
only in visits. It will be necessary to examine the methods in some 
detail. 

Three methods were used wherever it was not possible to apply 
intelligence tests, and, of course, it was not possible to apply tests to 
many of the ancestry nor even to many of the living descendents. One 
of these methods may best be learned by quoting from the observa- 
tions made by the field worker in visiting the homes of the Kallikaks. 

“The girl of twelve should have been at school, according to the 
law, but when one saw her face,? one realized that it made no difference. 
She was pretty, with olive complexion and dark, languid eyes, but there 
was no mind there.’* ‘‘The boy with her wore an old suit that 
evidently was made to do service by night as well as by day. A glance 
sufficed to establish his mentality,? which was low.’* ‘The father him- 
self, though strong and vigorous, showed by his face? that he had only 
a child’s mentality.’”* ‘‘She appeared? to be criminalistic, or at least 
capable of developing along that line.’ 

Such methods of character and mind reading have long been dis- 
carded by scientific psychologists. No reputable psychologist pre- 
sumes to make a diagnosis of feeblemindedness “‘at a glance,” or to 
infer criminalistic tendencies in embryo from ‘‘appearance.”” Why 
then should such evidence be accepted from relatively untrained field 
workers and made the basis of sweeping conclusions concerning the 
heritability of feeblemindedness? 

Another of Goddard’s methods which was used in the cases of the 
deceased was to make recourse to original documents whenever these 





1 Goddard, H. H.: “‘The Kallikak Family,’”’ Macmillan Co., 1912, pp. 18. 
2 Italics ours. 


’ Goddard, H. H.: ‘‘The Kallikak Family,” Macmillan Co., 1912, pp. 72-73. 
4 Ibid., pp. 78. 
5 Jbid., pp. 87. 
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were available. Such documents, he admits, were few. However 
this scarcity was considered almost as damaging evidence as the 
presence of such documents. ‘‘For instance the absence of a record 
of marriage is often quite as significant as its presence.”! Just how 
the absence of a marriage record can be used as a substitute for an 
intelligence test is a bit beyond this critic, so further comment will be 
withheld. 

A third, and important, means of identifying individuals as feeble- 
minded (when they were dead or would not submit to tests, or against 
whom there were no incriminating documents, either absent or pres- 
ent) was to ask the neighbors. ‘Some record or memory is generally 
obtainable of how the person lived, how he conducted himself, whether 
he was able to make a living, how he brought up his children, what was 
his reputation in the community; these facts are frequently sufficient 
to enable one to determine with a high degree of accuracy? whether the 
individual was normal or otherwise.” The memories concerning the 
earlier generations of the Kallikak Family were supplied by the 
reminiscences of elderly people. Their value can best be judged by 
some quotations. 

“Did you ever see the mother of old Martin?’* This question 
was put by the field worker to an old farmer who had just declared 
that little had taken place in that neighborhood during the last seventy 
years in which he had not had a part. The ‘‘old Martin” referred to 
was the son of the tavern maid and was born during the Revolutionary 
War. The response elicited was: ‘‘ No, she was dead before my time, 
but I have heard the folks talk about her . . . Dear me! it’s been so 
long since I’ve thought of these people that many times I forget, but 
it would all come back to me in time.’* Such memories have been 
shown by psychologists to have little value as evidence in courts of 
law, why then, should they be accepted in the courts of science? 

The talk of the village gossips is seldom of such a nature as to give 
an unbiased picture which a dispassionate scientist might use. Instead 
it is the sort that makes the evil one does live after him while the good 
is oft interred with his bones. The gossip about the Kallikaks is no 
exception to the rule. Novices in the field of feeblemindedness are 
apt to place undue weight upon moral laxity as evidence of subnormal- 
ity of intellect. Alcoholism, sexual promiscuity, thievery, etc., are 


1 Tbid., pp. 14. 
2 Italics ours. 
3 Ibid., pp. 85. 
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accepted as evidence against the mentality of the individual. We 
find that this was a strong element in the identification of the members 
of past generations of the Kallikaks as feebleminded. For example 
Martin Jr. is labelled as feebleminded but the evidence brought against 
him consists of these accusations: (1) That he is the ancestor of some 
individuals (several generat ons removed) who were diagnosed as 
feebleminded on the basis of intelligence tests; (2) that many of his 
descendents were social liabilities to the community and were morally 
of low standards; (3) and that he himself bore none too savoury a 
reputation. An old lady’s comment reports that “‘he was always 
unwashed and drunk. At election time, he never failed to appear in 
somebody’s cast-off clothing, ready to vote, for the price of a drink, 
the donor’s ticket.”! The old farmer, already referred to, added to 
the picture by saying in response to a question as to whether he remem- 
bered Martin Jr. ‘‘Do I? Well, I guess! Nobody’d forget him. 
Simple. Not quite right here’ (tapping his head) “‘but inoffensive and 
kind. All the family was that . . . That was the worst of them, they 
would drink . . . Old Martin could never stop as long as he had a 
drop. Many’s the time he’s rolled off of Billy Parson’s porch. Billy 
always had a barrel of cider handy. He’d just chuckle to see old Mar- 
tin drink and drink until he’d finally lose his balance and over he’d 
go.”’?, Comment concerning Billy’s mental status was withheld. Pre- 
sumably he is normal in spite of his low sense of humor. However 
Martin Jr. is, by this evidence, labeled ‘‘feebleminded” and exactly 
the same term is applied to his great-great-grand-daughter on the basis 
of an intelligence test. Does the term mean the same thing in each 
case? Possibly, but certainly there is no proof that it does. The 
reader of this account of the Kallikak Family cannot doubt that the 
social worker discovered a history of social degeneracy and ineptitude. 
But there is grave reason to doubt the conclusion that this deplorable 
social condition was due to the biological inheritance of feeble- 
mindedness. 

In one family described by Goddard we find an exception to the 
general tale. ‘‘ According to Mendelian expectation, all the children 
of Millard Kallikak and Althea Haight should have been feeble- 
minded, because the parents were such. The facts, so far as known, 
confirm this expectation, with the exception of the fourth child, a 





1 Tbid., pp. 80. 
2 Ibid., pp. 83-84. 
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daughter, who was taken into a good family and grew up apparently a 
normal woman.” Descendents from her for two generations have 
shown no sign of feeblemindedness, and her grandchildren are described 
as ‘‘normal and above average intelligence.” Perhaps the reader may 
suspect that she was an illegitimate child. Even if this is admitted, 
we still find it nard to reconcile this excellent family history with the 
sorry tale of the horde of feebleminded individuals who sprang from the 
union of a “‘normal’’ man with a “‘feebleminded”’ girl in a tavern in 
Revolutionary War times. In both cases we have the same alleged 
biological background (when we assume that the adopted girl was 
illegitimate) but in one instance, excellent environmental conditions 
prevailed, while in the other, the sordid environment which attended 
the illegitimate offspring of tavern maids existed. If heredity is so 
all-powerful, why should the descendents in one case appear normal 
and in the other, feebleminded? If we assume that the adopted girl 
was legitimate, then it is still more difficult to explain how heredity 
can be the powerful force it is claimed to be, for then we would have 
a normal strain springing from two feebleminded parents, and a sub- 
normal family strain arising from a union in which only one parent 
was supposed to be feebleminded. Freeman and his co-workers might 
be expected to back the proposition that an improved environment 
had improved the mentality of the strain. Whether that is a possibil- 
ity or not remains to be determined by experimental work of a more 
careful nature than is possible with these data. In the opinion of the 
writer, the issue of nature versus nurture, here, is rendered insoluble 
by the use of social criteria (which are influenced greatly by environ- 
ment) as a measure of a supposed biologically hereditary trait (which 
is relatively uninfluenced by environment). Consequently conclusions 
concerning heredity are pointless in this instance, and the validity of 
the Kallikak Family as a study of inheritance is weakened still further 
by this lack of internal consistency. 

That these methods have been criticized before is true. Davies? 
says: ‘‘Hearsay evidence had to be relied upon and one gains the 
impression that if this hearsay gave a picture of the individual as being 
shiftless, alcoholic, a ne’er-do-well or a criminal, the label of feeble- 
mindedness was likely to be applied to him.” Danielson and Daven- 


1 Tbid., pp. 24. 
Italics ours. 

2 Davies, S. P.: “Social Control of the Mentally Deficient.” Crowell Co., 

1930, pp. 153-154. 
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port, in a study somewhat similar to the Kallikak Family, but from 
which they drew somewhat more temperate conclusions, admit that 
“‘feeblemindedness is no elementary trait, but is a legal or sociological, 
rather than a biological term.’’! Elsewhere they add: ‘The distinction 
between an ignorant person who has normal mental ability and a high- 
grade feebleminded one who has not, is often as impossible to make 
as that between medium and low-grade feeblemindedness.”’ In spite 
of this realization of their difficulties, they did go ahead to study the 
matings of people described as feebleminded in this dubious fashion 
and then drew conclusions as to the probability of the inheritance of 
feeblemindedness. Their conclusions were not quite orthodox and 
Holmes? rejects them as internally inconsistent and their data as of 
little value. Curiously enough, however, he accepts Goddard’s work 
although the same methods were used in both studies. One finds it 
difficult to understand why Holmes accepts the one and rejects the 
other unless it is because the conclusions of one fit into the pattern 
of biological theory better than the other. Ellis* criticizes Goddard’s 
work at some length upon theoretical grounds which do not need to 
be repeated or examined here. Myerson‘ has provided the most 
violent criticism which the writer has seen. 

“‘T confess to a feeling of shame in the presence of the field work 
done in this case. I have had charge of a clinic where alleged feeble- 
minded persons were brought every day and I see in my practice and 
hospital work murderers, thieves, sex offenders, failures, etc. Many 
of these are brought to me by social workers, keen intelligent women, 
who are in grave doubt as to the mental condition of their charges 
after months of daily relationship, after intimate knowledge, and pro- 
longed effort to understand. Many a time it has happened that one 
of these excellent women has declared that her charge must be feeble- 
minded or insane, and yet the mental test and psychological examina- 
tions have shown the contrary, that the patient was of full average 
mentality or better; . . . And I have to say of myself, with due humil- 
ity, that I have had to reverse my first impressions many times.’’® 





1 Danielson, F. H. and C. B. Davenport: The Hill Folk. Memoirs of the 
Eugenics Record Office, vol. I, 1912, pp. 11 and pp. 3. 

? Holmes, S. J.: “‘The Trend of the Race.”’ Harcourt Brace and Co., 1921. 

3 Ellis, R. 8.: ‘‘The Psychology of Individual Differences.”” D. Appleton and 
Co., 1928. 

Myerson, Abram: ‘‘The Inheritance of Mental Diseases.”’ Williams and 
Wilkins, Baltimore, 1925, pp. 78. 

5 Myerson, Abraham: ‘Inheritance of Mental Diseases.’”’ Williams and 
Wilkins, Baltimore, 1925, pp. 78. 
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The criticisms which have come to the attention of the writer have 
rested largely upon logic and statements of impressions gained from 
experience. In the present instance, an attempt will be made to sup- 
plement logic and experience with facts, and facts are stubborn critics. 
The essential question which this paper must answer before the case 
against the defendents is clinched, is whether individuals, who are not 
feebleminded, are ever, or frequently, accused of being feebleminded on 
the basis of the sort of social and moral ineptitude exhibited by the 
Hill Folk, Kallikaks and others. If that question can be answered in 
the affirmative, then there remains no basis for discussing heredity 
when ancestors have been declared feebleminded by reason of their 
social difficulties and their descendents have been pronounced normal, 
or feebleminded, on the basis of intelligence tests. 

The late Dr. W. E. Fernald provided an interesting contrast 
between two groups of individuals, one of which he calls ‘‘ not feeble- 
minded” and the other feebleminded. The cases listed in the former 
group were brought to Dr. Fernald’s Clinic for examination on the 
suspicion that they were feebleminded. Dr. Fernald states that 
“‘their bad behavior was the principal reason why the members of this 
group were brought to the clinic.’’! 

The table shown on p. 672 adapted from Dr. Fernald’s work? shows 
interesting Contrasts and similarities between the feebleminded and the 
non-feebleminded. 

The reader will note particularly that the percentage of individuals 
whose moral reactions were bad was approximately the same for each 
group. In addition there is a bad family history in fifty-six per cent of 
the cases of the not-feebleminded group. Furthermore the personal 
and developmental history is unsatisfactory in over half the group. 
Now this is just the sort of thing which caused these cases to be exam- 
ined. And it is also the sort of thing which the neighbors learn and 
pass on from one generation to another, causing later field workers to 
render a verdict of feebleminded when these tales are told. But 
obviously these social indications of feeblemindedness are unreliable 
indices of intellectual feeblemindedness. Now suppose that these 
cases had not been brought in for examination but had been allowed 
to continue their anti-social ways with only the records of the courts, 
the memories of their neighbors, and the judgments of field workers 


1 Fernald, W. E.: Standardized Fields of Inquiry for Clinical Studies of Border- 
line Defectives. Mental Hygiene, Vol. i, No. 2, pp. 6. 
2 Tbid., taken in part from a chart and in part from the context. 
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standing against them. We might very well find a later investigator 
declaring them feebleminded (just as Goddard, Danielson and Daven- 
port, Freeman and others did do) on the basis of their social record and 
their children normal, or feebleminded, on the basis of intelligence 
tests. Would that investigator have any basis for conclusions con- 


cerning the inheritance of intelligence? The answer is obviously, 
6é No.”’ 


DIFFERENCES BETWEEN FEEBLEMINDED AND NOT-FEEBLEMINDED INDIVIDUALS 











Percentage scored | Percentage scored 
Fields of inquiry *‘minus”’ in the “minus” in the 
not-feebleminded feebleminded 
group group 
Physical examination.................. 33 80 
PIS CVSS Ldse vast ade cesses 3 56 72 
Personal and developmental history...... 52 92 
ND. oss intend bh uieebion boo 46 33 94 
Examination in school work............. 16 94 
Practical knowledge and information. .... 7 88 
Social history and reactions............. 39 92 
Economic efficiency.................... 30 85 
re Pr err 70 72 
Mental examination................... 13 98 








The question of how frequently this sort of faulty diagnosis might 
be expected to occur is not so easy to answer. However some clue 
may be obtained by examining the report of an institution for the 
feebleminded.' During the year, four hundred ninety-five patients 
were examined in the out-patient clinic which was held once a week. 
Of this number, 58.9 per cent were diagnosed as feebleminded. The 
remaining forty-one per cent were scattered among the following 
categories: Dull, borderline, normal, psychotic (deferred diagnosis, 
eight cases; and undiagnosed, ten cases). In brief, forty-one per cent 
of the patients bad enough to be misfits in a normal society and to be 
suspected of being feebleminded, were actually not feebleminded. To 
be sure many of them were defective in other ways, but there is no 
more reason for assuming feeblemindedness on the basis of these 
defects, than a diagnosis of pneumonia is logical whenever any person 





1 Annual Report of the Trustees of the Walter E. Fernald State School at 
Waltham, Mass., for the year ending Nov. 30, 1931, pp. 16. 
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is ailing. Surely this evidence is adequate to substantiate the writer’s 
contention that faulty diagnosis of feeblemindedness on the sole basis 
of social ineptitude occurs much too often to justify the use of such a 
method in studying biological heredity. 

That the studies just criticized have no value, the writer would be 
the last to contend. They have demonstrated that social levels of 
behavior and competency do tend to persist over a number of gener- 
ations. They may add to our understanding of the nature-nurture 
problem as it concerns social ineptitude and adjustment. It is quite 
possible that the extension of such studies as the one made by Free- 
man and his co-workers may eventually show that the low social level 
of a family may be due in considerable measure to the persistence of 
unfortunate social habits and ways of thinking. Furthermore that as 
long as children are reared in such folk-ways, there is little hope of 
their escape into more socially desirable ways of living; but if given a 
chance to grow up in a normal environment, they may become almost, 
or entirely, indistinguishable from the great mass of ordinary people. 
However, the problem of biological heredity is much too involved to 
be settled in any such fashion or by such faulty techniques, and it is 
to be hoped that the perpetuation of this error may soon cease. 





























THE CHANCE ELEMENT IN MATCHING TESTS 


JOSEPH ZUBIN 
Teachers College, Columbia University 


Chance plays a predictable réle in the score that an individual 
obtains in objective tests. The importance of the chance element 
varies, however, with the type of test. In the Recall Test chance plays 
a relatively unimportant part, while in the Multiple Choice and in the 
True and False Tests chance plays an increasingly important part. 
The influence of chance upon score is, of course, of no consequence 
when scores on the same type of test are compared with each other. 
However, when the “‘true’”’ score—‘‘score free of chance’”’ is desired, 
or when comparisons are made between scores obtained on different 
types of tests, the need for correcting for the influence of chance upon 
score is quite important. Several formulae have been proposed for 
the correction for chance in Multiple Choice tests and in True and False 
Tests. This paper deals with the development of correction formulae 
for the Matching Test. These formulae differ from those that have 
been proposed for the type of Matching Test that is known as the 
Continuity Test.! In the latter test the subject’s task is to rank 
the items in order of time, importance or some other continuous 
quality. In the matching test that is to be considered in this paper, 
there is no interpendence of this kind between the items. 

A matching test may be regarded as consisting of the task of match- 
ing items a, b, c,d... 2 with their apposite expressions—the num- 
erals 1, 2,3,4 .. . where the correct matching consists of associating 
item a with its apposite, 1, item b with its apposite, 2,and soon. We 
shall investigate first the case where none of the individuals have any 
knowledge about the correct association between the items and their 
apposites, and, therefore, match by sheer chance.’ 

When the matching test consists of two items, there are only two 
possible ways of performing the matching process. If N = number of 
individuals taking the test, half of them will match a with 1 and 





1See discussion of Continuity Tests in Cureton E. E. and J. W. Dunlap: 
Scoring the Rearrangement or Continuity Test. School Review, vol. XXXVIII, 
Oct., 1930, pp. 613-616. 

2 There are certain psychological reasons for believing that random matching 
in the mathematical sense rarely occurs. Primacy and propinquity probably 
affect the matching process. These psychological factors are not considered 
in this paper. 
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perforce b with 2 and the other half will match a with 2 and perforce 
b with 1. The score of the first set of individuals will be 2 and of the 
jast set will be zero. The mean will be 


[F-@+%-0| 
= ] 
N b 


and the standard deviation will also be 1, since 
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When the matching test consists of three items (n = 3), the follow- 
ing table may be drawn up: 














TaBLE I 

1 | 2 | 3 | Score} fz | fz? 

oo cs oie fe cet aehe ss mines ajbic 3 3 9 
Ustad ceknaeaeawns 4 eae dhe aic}|b 1 1 1 
I: cs n505ccekenet hawk evens és 4 bj alice 1 1 1 
I e's wv'e'vc ce cdies veedctece ves bjclia 0 0 0 
NGS i546 CWS Che Cpa bab iahks de c|al|b 0 0; 0 
I ss ir woigdl due webs a0 me dees vedee c | bia 1 1 1 
Total number of possibilities................ 6 6 | 12 























Each of the six permutations is equally likely by chance, and the 
mean and o are 


M=1 o = %vV6(12) — 36 =1 


The mean is again found to be 1 and the sigma also 1. The general 
equation for the mean and the sigma may be derived as follows:' If 
the test consists of the task of matching items a, b,c . . . n with the 
numerals 1, 2, 3, . . . w, there are factorial n different ways of com- 
pleting this matching task. That is, there are factorial n different 
patterns of responses to the nm items. Each one of these factorial n 
patterns or sets is equally likely to occur by chance and will have a 
score depending upon the number of items that are matched with their 
correct apposite. The score will be (n — r) where r is the number of 








1 The author is indebted to Dr. Helen M. Walker and to Professor H. Hotelling 
for their very kind helpfulness. 
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displacements, that is, the number of items which are not matched 
with their proper apposite. In order to obtain expressions for the 
mean and standard deviation (of chance matching), it is necessary to 
find the general expression for the frequency of each score. 

The probability or frequency of any score, (n — r) is dependent 
upon the number of possible different ways that a score of (n — r) 
can be obtained from n items. That is, the frequency of (n — r) 
is equal to the number of sets having (n — r) items matched correctly 
or r items ‘‘mismatched.” It is somewhat simpler to deal with the 
latter mathematically and we shall confine our attention to determin- 
ing f,, the frequency of sets having r displacements or mismatchings. 

Generally, r items can be selected from a total of n in C," ways. 
In each set containing r displacements, the displaced items can be 
arranged in K, ways and still retain r displacements. Hence, the total 
number of sets having r displacements is K,-C,". The value of K, 
is of course dependent only upon r and can be found empirically as 
follows: When r = 0, K = 1 since fp = KyCo” = 1. When r = l, 
K = Osince there is no way of mismatching one item with its opposite. 
When r = 2, K = 1 since there is only one way of arranging a and b so 
as to be mismatched with 1 and 2, and that is a2, bl. Whenr = 3, 
K = 2 since a, b, and c can be mismatched in two different ways with 
1, 2 and 3 as follows: azb3c; and asbic2. Any other arrangement will 
not yield three displacements. When r = 4, K = 9. For, there are 
three ways of mismatching the letters with the numerals so that 
a remains mismatched with the same numeral, 2. 


debicads Aabscqd 1 A2b4C1d3 


Now retaining the letters in the same order, 2 can change places with 
the numerals associated with c or d but not with the numeral of b 
(for then, the number of displacements would be reduced by 1). 
Hence there are two more groups of three sets each similar to the above 
or a total of nine different ways of mismatching the four letters with 
the numerals. 

In a similar manner the value of K; is found to be forty-four and 
Ke = 265. The value of any K, K,, may be found by noting that the 
sum of the frequencies of all the different possible scores is n! and! 


ni= KC o" oa K,C;" aa KC." + giatins K,C,” ee KC,". (I) 





1 Equations (I) and (II) are well known in mathematical literature and are 
given by P. A. Macmahon in ‘““Combinatory Analysis,” Cambridge, 1915, Section 
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After letting K, have the value of the K which is being sought, the 
equation can be solved for K, and its value determined in terms of the 
previous K’s. 

In order to obtain the value for the mean score of this distribution 
of random matchings, each frequency is multiplied by its correspond- 
ing score and the result is as follows: 


nig = KoCo"(n — 0) + KiCy"(n — 1) +--+ KC"(n-—r)+--:-: 


K,C,"(n — n). (ID) 
By means of equation (I) it can be shown that 


Tele ee Oe (—1) 
K=rg-t+a-nt SP 














Hence 
r 1 1 1 1 
nig = DKCr(n — 7) = rin Si) Mania ae 
r=0 
r=n—l1 
_™ a + i SP 
=a 1! ° 2! r} 
r=0 
or 
r=n—1l1 
- > en 
Se 1)! ed i! 
r=0 


By direct substitution for various values of n it can be shown that 
Z is equal to unity. Similarly, 
r=n—1 r 
( 2) a Zfr(n = r)? id (n oa r) (—1)* 
in n! ‘(n—r—1)! a! 
0 


r=0 








is found to be equal to 2 for the various values of n that have been 
tried out. The standard deviation which is equal to 


o? = (x) — (4)? =2—1=1 
is thus found to be unity also. Hence both the mean and the sigma 
of random matching are equal to 1. 





III, Chapter III—The Theory of Displacements, and in an article by the same 
author in the Transactions of the Cambridge Philosophical Society, Vol. XXI, 
No. XVIII—The Problem of Derangement in the Theory of Permutations. In 
Macmahon’s symbolism K, = {0; /*}. 
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We may now proceed to determine the relationship between the 
true score on a matching test—the score earned by dint of knowledge 
of the correct association between the items and their apposites—and 
the obtained score resulting from a combination of the former and the 
influence of chance matching. 

If a sufficiently large population is taken, we can regard the popu- 
lation as made up of separate subgroups according to the number of 
items that they really knew. The true score of each subgroup will be 
(n’ — r) where r is the number of items that they did not know actually 
but matched correctly by sheer chance. The standard deviation of 
the true scores for any one of these subgroups will of course be zero. 
Each subgroup except the one that knew all the items will increase its 
score through chance. The amount of increase can be determined as 
follows: If a group knew n’ — r items, it responded by chance to r 
items. It has been shown above that when r items are matched by 
chance (regardless of the value of r) the expected value of the mean is 
unity and the expected value of the standard deviation is also unity. 
Hence the obtained mean of each of the subgroups will be n — r + 1 
and the obtained standard deviation will be unity. In order to obtain 
the mean of the entire population from the mean of the subgroups, we 
proceed as follows: 

Let po = number of individuals who did not know any of the items 

pi. = individuals who knew one item 
P, = individuals who knew all n items 

If a score of 1 is given for every item matched correctly, the true 

score xz of the various groups is 





zo = 0; 21 = 1} 22 = 2; 2, = nj and # = 57 = 
(poto + Piti + Dore: - ° ). 
N 


Since by chance, every group will increase its score by 1, except the 
Pn group, Zz’, the obtained mean, is 


pr — (Polto +1) + palms +1) + + * + Dortena + 1) + Pocal 
N 


— (Poto + pit: + Pete + * + * Potn + Pot pit Pot * * * Pn-1) 
N . 








and # =#+1-— i Thus the obtained mean is 1 — units 


larger than the true mean. It is characteristic of good tests that they 


“A _— a —_—_— oa 
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permit few, if any, of the subjects to attain perfect scores. We may 
therefore regard p, = 0 for most good tests. Hence #’ =2+1. In 
order to derive the expression for the standard deviation of the obtained 
scores of the total population in terms of the true scores, we apply the 
well known equation for the standard deviation of the total population 
in terms of the means, sigmas and populations of the subgroups: 


No = N,(o;? + dy”) + No(oo? + do”) + + + + Na(on? + d,?). 


Where o; = sigma of subgroup 1 and d; is the deviation of the mean of 
the subgroup from the mean of the total population. Now in our 
problem, o,? = 22 = +--+ o,? = 1, since these are the standard 
deviation due to chance in each of the subgroups. As for ZN ,d,’, it is 
equal to Noa,? since the true scores of each of the subgroups differ from 
the obtained scores only by a constant, and hence the sigma is not 
affected. For our problem, this expression reduces to oo? = o,2 + 1 
(where op = standard deviation of obtained scores and o; = standard 
deviation of true scores) or o; = /oo? —1. This represents the 
relationship between the true and obtained mean and sigma in one 
subtest. In order to obtain the relationship between the true and 
obtained mean and sigma in the total test, we proceed as follows: 

If there are k subtests or matching problems the obtained mean of 
the entire group on the entire test will be k units greater than the true 
score. Hence, to correct for chance, k units should be subtracted. 
The above analysis will hold true only if the number of cases is suffi- 
ciently large or, if the number of individual subtests is sufficiently 
large. For the relationship between the true and obtained sigmas, 
we proceed as follows: 

Oo? = G01? + G2? + 2ro1,02701702, When two subtests are considered, 
where oo; represents the sigma of the obtained scores in the first sub- 
tests, oo2, in the second subtest and a the sigma of the whole test. 

Now 1o01,02001002 = 2Xoit02/N = Lturta/N =Truwcuge for this 
sum of the cross products of the true scores is the same as the sum of 
the cross products of the obtained scores, since the sum of the cross 
products of the credited elements due to chance is approximately zero. 
Hence 7o1,02701002 = Trt 11012 and 


oo” =o’ + 1 + 12? +1+ 2rd = a. +2 


and for m subtests, oo? = 0:2 + m ore, = ~/oo? — m. 
An estimated ‘‘true score” of a single individual could be similarly 
obtained by deducting a unit for each subtest from the total score on 
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all the subtests. For this purpose, however, the number of subtests, 
should be very large. Since the influence of chance is independent of 
the number of items in the subtest, it would be advisable to use sub- 
tests with small number of items in order to increase the number of 
subtests and thus allow for the application of the correction for chance. 
Thus, subtests consisting of four or five items are preferable to sub- 
tests of larger number of items. This preferability isenhanced by 
the fact that subtests consisting of a larger number of items are not so 
easily handled and are quite disturbing in complexity. The number 
of items should not, however, be reduced too low, for then the oppor- 
tunities for scoring by means of the process of elimination are 
increased. 

It should be noted that the above correction formulae are valid 
only when every individual attempts every item. For only under 
such conditions will the full effect of chance be realized. In order to 
obtain such a condition, the subjects should be told not to omit any 
items and to guess when they are in doubt. 





CONTINGENCY BETWEEN THE ITEMS AND THEIR APPOSITES 


If the individual items of the test have no intrinsic relationship 
to their apposites, as is the case in the matching of nonsense syllables, 
the number of individuals matching item a with apposite 1 should not 
differ significantly from the number of individuals who match item a 
with apposite 2, or with any other one of the apposites. The degree 
to which the individual items are associated with their ‘‘true’’ apposites 
is a measure of the intrinsic relationship between the item and its 
apposite. The presence of this association may be determined by 
means of the x? method as revised by Fisher.’ 

The matching test presents a specialized contingency table—one 
in which the marginal frequencies are equal. Thus, for n items and a 
population JN, 


>( 7) 
UME EP od 
2? rz? 
2 = = fa |e Z 
b at Nz 
re 4 : where z = number of cells in a row (or a column). 


1 The author is grateful to Dr. J. W. Dunlap for calling this point to his 
attention. 


2See Fisher: ‘Statistical Methods for Research Workers.” 1930 edition, 
pp. 82-84. 
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Hence, in order to determine x? all that is needed is the sum of the 
squares of the frequencies in the cells of the table. The presence of an 
association between the items and their apposites can be determined 
by Fisher’s method, entering his table of P withn = (z — 1).? 


SUMMARY 


The influence of chance in matching tests was investigated. The 
relationship between the true and obtained means and sigmas is as 
follows: 

M t= M ge @ 
o= V oo" —_' oe 


where ¢ represents true score and 0 represents obtained score and m 
is the number of subtests. If the number of subtests is sufficiently 
large, an individual’s score may be corrected similarly by subtracting 
m, the number of subtests, from the obtained score. A shorter method 
for obtaining x? for the items of a matching test was also given. 
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all the subtests. For this purpose, however, the number of subtests, 
should be very large. Since the influence of chance is independent of 
the number of items in the subtest, it would be advisable to use sub- 
tests with small number of items in order to increase the number of 
subtests and thus allow for the application of the correction for chance. 
Thus, subtests consisting of four or five items are preferable to sub- 
tests of larger number of items. This preferability isenhanced by 
the fact that subtests consisting of a larger number of items are not so ‘ 
easily handled and are quite disturbing in complexity. The number f 
of items should not, however, be reduced too low, for then the oppor- 
tunities for scoring by means of the process of elimination are 
increased. 

It should be noted that the above correction formulae are valid 
only when every individual attempts every item. For only under 
such conditions will the full effect of chance be realized. In order to 
obtain such a condition, the subjects should be told not to omit any 
items and to guess when they are in doubt.' 


CONTINGENCY BETWEEN THE ITEMS AND THEIR APPOSITES 


If the individual items of the test have no intrinsic relationship 
to their apposites, as is the case in the matching of nonsense syllables, 
the number of individuals matching item a with apposite 1 should not 
differ significantly from the number of individuals who match item a 
with apposite 2, or with any other one of the apposites. The degree 
to which the individual items are associated with their ‘‘ true”’ apposites 
is a measure of the intrinsic relationship between the item and its 





apposite. T 
means of the ERRATA 
The mate 
in which the : Equation on page 680 should read: 
population N ( _N- Ny 
Nz ( 2" ) 
2 — = > oli 2Z 
. Zz NN N 
Nz. 
where 2= nu - EpucaTIonaL PsycuoLocy, December issue. 





1The author is grateful to Dr. J. W. Dunlap for calling this point to his 
attention. 


See Fisher: ‘Statistical Methods for Research Workers.” 1930 edition, 
pp. 82-84. 
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Hence, in order to determine x? all that is needed is the sum of the 
squares of the frequencies in the cells of the table. The presence of an 
association between the items and their apposites can be determined 
by Fisher’s method, entering his table of P withn = (z — 1).? 


SUMMARY 


The influence of chance in matching tests was investigated. The 
relationship between the true and obtained means and sigmas is as 
follows: 

M ‘= M oo" ™ 
Oo, = V/ 00" — m 


where ¢ represents true score and 0 represents obtained score and m 
is the number of subtests. If the number of subtests is sufficiently 
large, an individual’s score may be corrected similarly by subtracting 
m, the number of subtests, from the obtained score. A shorter method 
for obtaining x? for the items of a matching test was also given. 
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GRAPHICAL DETERMINATION OF PROBABLE ERROR 
IN VALIDATION OF TEST ITEMS 


DAVID F. VOTAW 
Southwest Texas Teachers College 


One phase of the work of validating tests consists of determining 
the selectivity of individual items of the proposed test. A test item 
which is answered correctly by fewer good students than poor students 
will need to be rejected because of adverse selectivity. On the other 
hand, even though a test item be answered correctly by more good 
students than poor students there is always some probability that the 
difference was due to pure chance. 

It therefore becomes the task of the test validator to determine the 
proportions of the upper and lower groups answering a given item 
correctly. He will then need to decide on some degree of probability, 
somewhere short of absolute certainty, that the difference between the 
means of the two proportions was not due to chance. 

Some test makers are satisfied if this difference is as much as two 
times the probable error of the difference, while others will not accept 
the item unless the difference is at least five times the probable error. 
A difference of three times the probable error insures a probability of 
about twenty to one that the difference was not due to chance. 

This paper is not concerned with the merits of various criteria for 
selecting the upper and lower groups of students. The statement 
should be made incidentally, however, that it was revealed by Jensen,! 
who in turn gave credit for the development of the proof to Dr. Truman 
L. Kelley, that the size of the upper and lower categories should be 
each twenty-seven per cent of the total number of students to produce a 
maximum ratio between the difference of their means and the probable 
error of the difference. However, the technique proposed herein is 
suited as well to other percentages which might be chosen as it is to 
twenty-seven per cent. 

To continue with the development: 
Let p1 = proportion of upper group answering item correctly, and 

P2 = proportion of lower group answering item correctly. 





1 Jensen, Milton B.: ‘‘ Trait Differences Between Three Groups in Education.” 
(Unpublished Doctors Thesis, Stanford University, 1927), pp. 26-27. 
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Then the ordinary procedure for determining the acceptance of the 
R item is as follows:! 





.6745 
PE,, = 141 1 
_ Pq (1) 
6745 
PE,, = ~/N P2Q2 (2) 
Therefore 
PEws:-7) = (SRY Pith) + (AV Pita) (3) 


Ordinarily the above three computations will need to be made for 
each item concerned. The principal object, however, is not to find the 
actual number of PE’s between the means of the two proportions but 
, merely to reject items with dangerously high probability that they are 
not selective of good students. Therefore, it becomes desirable merely 
! to keep the difference between the means of the two proportions equal 
to or greater than some constant (k) times the PE of their difference. 
As only critical points need to be considered here the condition stated 
above may be written: 


yee (% 8745 Vem) + (“3V> Vm) (4) 


aie Se ie ee. | 











k 
=F —=V Pid + D292 (5) 
But 
qi = (1 — pi) 
and 
gz = (1 — pr). 
Substituting these values in (5) it becomes: 
pm ae Pi — Pi? + P2 — Pro’. (6) 
For convenience in computation set: 
6745k _ h 
VN 





1See Karl J. Holzinger: “Statistical Methods for Students in Education.” 
Ginn and Company, 1928, pp. 235-237. 
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Then 





Pi — P2 = hv/pi — pi’ + po — pr’. (7) 
Squaring: 


pi? — 2pip2 + po? = h*pi — h*p,? + h*p2 — h?p,?. 
Collect to solve for pi: 


(1 + h?)pi? + (—2pe — h?)pi + (po? + h*p2? — h?pe) = 0 
and 





_ 2p2 + h? + +/(8h? + 4h*)(p2 — po?) + h* (8) 
2(1 + h?) 


Formula (8) being general may be used for any k and N. 

Example of application: 

A test of two hundred sixty items which has been given a pre- 
liminary administration to one hundred seventy students is being 
validated. The responses of the highest forty-six (twenty-seven per 
cent) have been separated from the responses of the lowest forty-six 
students. Therefore N is forty-six. 

The ordinary procedure would involve three times two hundred 
sixty computations. If the mean proportion of the upper group who 
responded correctly on any given item is three PE or more above the 
mean of the proportion of the lower group who responded correctly 


the difference will be regarded as significant and the item will be 
retained. Then k is three. 


, = B745k _ .6745 X 3 


Pi 








- - = .2982 
1/N 4/46 
| h? = .088923 
h* = .007907 


Substituting these values for h? and h‘ in (8): 
— 2p2 + .088923 + +/.743012(p2 — po”) + .007907 
i= 








2.177846 (9) 


As many 7; values as desired may now be found to correspond to 
successive p2 values substituted in (9). The data may be tabulated 
or, better still, expressed graphically for reference during the validation 
of the test, thus saving the labor of a large number of computations. 


Ten to eighteen points on the graph are sufficient, perhaps, when N is 
fifty or less. 
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The table on page 686 illustrates the use of the formula by its 
application to the problem at hand. 

The accompanying graph which was made from data of the preced- 
ing table may be entered from left side with either the proportion or 
number of correct responses of the lower group. 
oF in alles ios be 
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Example.—Fifteen pupils of lower group have answered an item 
correctly. Enter graph from side at point fifteen. Follow horizon- 
tally to intersection with curve. Follow vertical line at intersection to 
the number at top which is found to be twenty-four, the least number of 
pupils of the upper group who must answer the item correctly to leave 
practically no doubt of its selectivity. 

It is obvious, of course, that the process may be extended to the 
construction of a system of graphs upon the p; and pz axes for a series 
of N’s. 
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PROPORTIONS AND NumBERSs oF Upper Group CoRRESPONDING TO GIVEN 
PROPORTIONS OF LOWER GROUP 





























(N = 46) 
HS Lower group Upper group 
ih No. P2 No. P1 
14 0 .0000 4 .0817 
Ps 1 .0217 6 .1314 
‘ 3 .0652 10 . 2065 
4 5 .1087 13 .2702 
bau 7 .1522 15 8285 
4, 10 .2174 19 .4042 
th 15 .3261 24 . 5303 
' 20 .4348 29 .6405 
‘ 23 .5000 32 . 7020 
ne 25 .5435 34 . 7412 
tf 30 6522 38 8326 
AL 35 7609 42 9182 
: 4 38 .8261 44 .9549 
ob 40 . 8696 45 .9788 
+ 41 .8913 46 1.0000 
th 
ie | 1 Numbers in this column are the whole numbers nearest to the computed 
i values of 7. 
‘ in ; 
k| 
\ it | 
ei) 
h} 
7 
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THE RELATIVE EFFECT OF THREE ORDERS OF 
ARRANGEMENT OF ITEMS UPON PUPILS’ SCORES 
IN CERTAIN mene Tas AND SPELLING 

1 


VIRGINIA LEE CAPRON 


University of Minnesota 





In the report of an experiment on the learning curve? the point 
was made that the tests used were the usual type of scale arranged 
with items in order from easy to hard. Discussion brought out the 
fact that there is a general assumption that such an arrangement is 
the only valid one. If arrangement does have a statistically impor- 
tant bearing upon scores, the results of a test in which there were items 
of increasing difficulty arranged in the ‘‘accepted order’’ would not be 
comparable directly to those in which order had been disregarded. 

In the literature of tests and measurements there is little evidence 
concerning the why of the easy-to-hard arrangement in power tests. 
Paterson’ and Ruch‘ subscribe to the “shock absorber” theory of 
arrangement; in regard to the arrangement of items after the initial 
motivation, Ruch states that easy-to-hard order increases both the 
validity and the reliability of the test and urges teachers to take care 
to construct a series of items of gradually increasing difficulty. In 
fact, almost all authors of books on constructing objective examina- 
tions and power tests, when they mention order at all, advise easy-to- 
hard, but no data substantiating the advice are given. 

It was, then, to arrive at some experimental evidence concerning 

‘ the relative effect of arrangement of items on the scores obtained that 
this study was undertaken. Specifically, the problem was to discover 
the relative effect of arrangement in easy-to-hard, hard-to-easy, and 
random order in certain educational measurements involving arith- 
metic problems, spelling, and fundamental processes in arithmetic. 





1Summary of unpublished master’s thesis of the same title worked out under 
the joint direction of Dr. L. J. Brueckner and Dr. M. J. Van Wagenen of the 
University of Minnesota, 1932. 

2 Harbo, Rolf T.: “‘Growth in Silent Reading.’’ Unpublished master’s thesis, 
University of Minnesota, 1928. 

* Paterson, Donald G.: “Preparation and Use of New Type Examinations.” 
World Book Company, Yonkers-on-Hudson, New York, 1926, pp. 45-46. 

« Ruch, Giles M.: ‘‘The Objective or New Type Examination.” Scott, Fores- 
man and Company, New York, 1929, pp. 32-35. 
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What, if any, is the change in difficulty of the test as a whole and of 
each item caused by a change in the order of arrangement of the 
items? Is the effect of such a change most pronounced in the cases of 
problems, spelling, or arithmetic processes? What is the relative 
effect upon superior, average, and dull pupils? Is mental age a factor? 
Is chronological age a factor? Is there a difference in this regard 
between boys and girls? 

The method used in the study was experimental and statistical. 
The subjects were pupils in Grade 5A in six schools and in Grade 8B 
in five schools in the same section of Minneapolis. These two classes 
were chosen in order to obtain as wide as possible age limits within the 
classes in which reasonable skill had been developed in the subjects 
tested. In each of these grades three groups were equated as nearly 
as possible on the basis of chronological age, mental maturity, intelli- 
gence quotient, and general social status. The instrument used to 
determine mental maturity and intelligence quotient was the Otis 
Self-Administering Test of Mental Ability, Intermediate Examina- 
tion, Form B. This test is applicable to both fifth and eighth grade; 
so the scores obtained are directly comparable. A total of four 
hundred fifty-three pupils took part in the experiment. The following 
scales were used: 


Minneapolis Problems in Arithmetic Scale. 
Grade 5A: Scale R, Division 1 (twenty-five items). 
Grade 8B: Scale R, Division 2 (twenty-five items). 
Unit Attainment Scales, Spelling Scale B. 
Grade 5A: First thirty-eight words. 
Grade 8B: Entire scale of seventy words. i 
Brueckner-Van Wagenen Scale for Fundamental Processes in Arithmetic. 
Grade 5A: Division 2, Form A (thirty-two items). 
Grade 8B: Division 3, Form A (thirty-two items). 


The forms chosen in each case contain items which are easy and 
progress through a few which are extremely difficult for the pupils in 
the grade to which they are administered. Each test was arranged in 
three orders: easy-to-hard, in which the easiest item was first, followed 
by items of gradually increasing difficulty; hard-to-easy, in which the 
most difficult items occurred first, followed by gradually easier items 
until the easiest was found as the last; and random order. Obviously 
there might be any number of random orders. It was found that the 
most truly random order could be obtained by drawing the numbers 
of the items from a hat. The problem and fundamentals tests were 
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mimeographed in each of the three orders; the spelling words were 
dictated by the examiner. Special emphasis was placed on the 
direction that each child should attack the items on all tests in the 
order given. 

The order of administering the tests was rotated very carefully in 
order to equalize the factor of practice effect and to eliminate the possi- 
bility that changes in score were due to the order of taking the tests 
rather than to the order of arrangement of the items within the test. 
The plan was to give in each group within each grade first a test in 
problems, then spelling, and third arithmetic fundamentals. Such a 
series of three tests might be called a cycle. This cycle had to be 
repeated three times in each case, giving a total of nine tests, in addi- 
tion to the intelligence test, for each child. As shown in Table I the 
three tests within a cycle were each in a different order of arrangement, 
and each group followed a different but regular schedule in taking the 
tests. The pupils, of course, did not know that the tests would be 
presented three times. The tests were given in cycles rather than 
repeating three spelling or problem tests in a series in order that 
methods of attack would not be facilitated. The pupils were allowed 
ten minutes longer than the standardized time for the problems and 
fundamentals tests; this allowed all except a few laggards to finish 
any of the arrangements. Papers were eliminated in which all three 
orders in any one subject had not been completed either because of 
absence or lack of time. The spelling tests were dictated with 
the time controlled by stopwatch; the same amount of time was 
allowed for each particular word regardless of the order in which 
it appeared. 


TasBLE I.—ScHEDULE OF ROTATING GROUPS 











Cycle Group Problems Spelling Fundamentals 
I A Easy-to-hard Hard-to-easy Random 

B Hard-to-easy Random Easy-to-hard 

C Random Easy-to-hard Hard-to-easy 

II A Hard-to-easy Random Easy-to-hard 

B Random Easy-to-hard Hard-to-easy 
C Easy-to-hard Hard-to-easy Random 

III A Random Easy-to-hard Hard-to-easy 
B Easy-to-hard Hard-to-easy Random 

C Hard-to-easy Random Easy-to-hard 
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One picture of the effect of order of arrangement is presented in a 
comparison of the number of errors on each item when in easy-to-hard 
order with the number of errors when in hard-to-easy and random 
orders. Such a study revealed very slight differences in percentage of 
error in the great majority of the items, and where notable differences 
did occur, these variations tended to be almost random. The average 
percentages of error including omissions for each type of material and 
for the three orders of arrangement is shown in Table II. Throughout 
this report the following notation has been employed: 


I: Easy-to-hard 
II: Hard-to-easy 
III: Random 


TaB_LEe IJ.—AVERAGE PERCENTAGE OF ERROR 











Order of items 
I II III 
Problems 
SE aR Ci Cerne race 39.4 37.9 36.3 
i pa apap Nips lhe parE aay. tet 49.1 48.4 47.9 
Spelling 
Me Pe oe i'd oor ade lwes eee cwale 29.7 31.5 32.3 
ails Leib ade nelace abies tok 36.5 36.3 36.5 
Processes 
ea ak eh ae 4.3 74.1 74.8 
(RNs RI SAS ase rie BYE Ne EBs 60.1 62.3 68.9 














From Table II it can be seen that there is no consistent tendency 
for all tests. For problems the random order of arrangement resulted 
in the smallest percentage of error in both 5A and 8B; the hard-to-easy 
order also resulted in a smaller percentage of error than the easy-to- 
hard order. Exactly the reverse is true for 5A spelling and 8B arith- 
metic processes tests. In the 8B spelling and 5A arithmetic processes 
tests there is practically no difference in the percentages of error for 
the three orders of arrangement. In other words, three different sets 
of conclusions, mutually contradictory, are deduced. It seems 
obvious that the question of the effect of order of arrangement presents 
a series of involved relations that should be intensively studied by those 
interested in the construction of tests. For the present one may con- 
clude that the order of arrangement may not be a decisive factor. 
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A second method for discovering the effect of order of arrangement 
on pupils’ scores is to determine the differences between mean changes 
in score on groupings of any two arrangements of items. If the mean 
change in score should prove large, one might conclude that the differ- 
ence was due, at least in part, to the change in the order of presentation 
of the items. 

All tests were marked on a raw score basis. Instead of using the 
scores themselves the plan used to measure the effect of arrangement 
has been to find the differences in scores on each paper on any two forms 
of a test and to use this figure as a basis for the statistical treatment. 
Thus a pupil with a score of ten problems correct on order I and six on 
order II will have a “difference score” of four, the same as a pupil with 
a score of fifteen correct on order I and eleven on order II. The mean- 
ing of 2 difference in mean score of four between order I and order II 
would be that there was an average of four more problems solved cor- 
rectly when the items were arranged in easy-to-hard order than when 
arranged in hard-to-easy order. When “difference scores” are used, 
the results of the tests in several schools in which pupils are not equally 
advanced may be treated together with less danger of misleading con- 
clusions since the differences are statistically comparable. Table III 
indicates in summary fashion the results obtained when dealing with 
Grade 5A as a whole and with Grade 8B asa whole. The differences 
will be seen to be slight when one realizes that the scale contained 
twenty-five items for problems, thirty-eight items for Grade 5A or 
seventy for Grade 8B for spelling, and thirty-two items for arithmetic 
fundamentals. 

On the same table will be found the mean differences for chrono- 
logical age groups. The ten year old group includes all the children 
of from ten years and no months to those of ten years eleven and 
nine-tenths months. Other age groups were made up in the same 
manner. The difference scores were then treated from the point of 
view of mental maturity. The score on the Otis Test is in itself a 
measure of mental maturity. This score has in each case been used 
as a basis for placing an individual within a Mental Maturity Group. 
The following are the scores and the Binet equivalents of the groups 
shown: 


errr rr eT below 7—4 to, but not including, 10-0 
csiascesst aunbaeens 10-0 to 11-6 
DMicecscivsaasecaseeen 11-6 to 14-0 
nck sense cess 04keu ees 14-0 to 15-8 


eS 15-8 and above 
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TaBLe [TI.—Dirrerences in CHANGES IN NuMBER oF ITEMs CoRRECT 














: Arithmetic 
Problems Spelling fundamentals 
I-II I-III I-II I-III I-II I-III 
(a) By grade. 
Grade 5A 
REA SR PRN FF 152 152 154 154 157 157 
Ms clakiveesicveudiechoves +0.08 | —0.35 | +0.74 | +1.27 | +0.21 | +0.35 
a eh cue dun kh and oon 2.61 2.75 3.64 3.44 2.24 2.80 
Grade 8B 
DSc Wedeke Cdkbs Und see 213 213 201 201 221 221 
De thchdsdandededns’e does +0.53 | +0.12 | +0.55 | +0.71 | +1.65 | +1.51 
Dib shonmebannsenteasocen 2.66 2.88 4.52 4.69 4.12 3.61 
(6) By CA groups. 
Ten year olds. 
DOG hddndsbeccccces sewes 61 61 57 57 59 59 
Di chosen tetachendade +0.71 | +0.52 | +0.18 | +1.34 | —0.08 | —0.08 
Dia, Aachevechseénes cena he 2.77 3.05 2.25 2.83 2.52 3.24 
Eleven year olds. 
Seis catedidowne wtesveds 54 54 55 55 56 56 
ittibdansive td >aene es —0.78 | —0.95 | +0.64 | +0.91 | +0.16 | +0.59 
GT esedUe bow badvdwscvioes 3.05 2.65 2.64 2.78 2.47 2.59 
Twelve year olds. 
PT hicistaen abound de tar 70 70 69 69 79 79 
Sich yt ced ative ceukdeew eb oer —0.13 | +0.21 | +1.41 | +0.89 | +1.61 | +1.35 
Sbaks sess 2500 ota netics an eh 2.47 2.94 5.93 6.69 3.71 2.90 
Thirteen year olds 
De eckoies's sncddeereens 114 114 110 110 118 118 
tins He dent cscdendsedent +0.79 | —0.13 | +0.27 | +1.07 | +1.73 | +1.52 
DiVatbrenadbansevddat ones 2.64 3.00 3.97 4.18 4.10 3.92 
Fourteen year olds 
Ee SS SEES AS pS 9 66 66 64 64 66 66 
Di iiitisheatvisaunaauachua +0.59 | —0.14 | +0.66 | +0.52 | +0.68 | +1.37 
cdéhes tkaveeu dees s dack Seu 2.90 2.87 3.16 2.91 3.75 4.16 
(c) By mental maturity groups 
0-23. 
ss chia wthe etadw 0tin td 15 15 16 16 17 17 
Siupsakdcseded>odeckons —0.10 | +0.10 | +0.38 | +0.94 | +0.44 | +1.32 
Minds 40040 cbdsnbhoedehened 2.15 2.42 3.95 4.00 2.29 1.92 
23-35. 
Cok wiribin aires ones dt 78 78 73 73 78 78 
Micéascvecdsavaderveoedts +0.54 | —0.19 | +1.38 | +1.62 | +0.96 | +0.31 
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Tasuw III.—Continued 
‘ Arithmetic 
Problems Spelling fund tal 
I-II I-III I-II | I-III I-II I-III 
(c) By mental maturity groups (continued) 
51-60. 
Ss vcscdveves dee caeeel 92 92 89 89 97 97 
Ee pe ae +0.29 | —0.07 | +0.93 | +0.29 | +1.66 | +1.31 
tht << inte nedabwebemonne 2.80 3.23 3.73 4.38 4.5 3.93 
60 and above. 
inst oe pwiked ean 38 38 34 34 38 38 
Tt ih vod 6 eine beehegaenee +0.32 | —0.47 | +0.00 | —0.26 | +1.82 | +2.21 
DGdsctcet cess uckectanende 2.80 2.69 2.93 3.33 3.94 3.74 
(d) By IQ groups 
58-80. 
ETE CO 16 16 16 16 18 18 
Dic isereccovnesadasesaknd +0.75 | +0.75 | +2.81 | +3.19 | +1.67 | +2.12 
Dsiceskesdusehenee eben 2.23 2.60 8.75 8.55 2.54 1.88 
80-90. 
DD: ncintasaaeiedeanen 33 33 37 37 38 38 
Diba conses 6400s s0eUssaas +0.23 | +0.05 | +0.61 | +0.93 | +1.26 | +1.58 
Db bkhde6ckeboareceneehawead 2.77 2.57 2.87 3.55 3.33 3.26 
90-110. 
DD hwidies bbe coenedaoe 211 211 209 209 220 220 
Di eccedée ceeds bbs baa deh +0.46 | —0.04 | +0.67 | +1.04 | +0.68 | +0.74 
Diheskciktsd<étackintianens 2.81 2.89 4.10 4.21 3.51 3.56 
110-120. 
TO cn ccennd anee sabaeun 85 85 76 76 85 85 
Pbnsctsdctakectekekumibice +0.20 | —0.3 —0.04 | +0.15 | +1.61 | +1.54 
cies otedveceeteeetenae 2.89 3.10 2.98 3.35 4.13 3.64 
120-137. 
EE Ce ee 20 20 17 17 17 17 
Dtccckuks ceteaweuters eee —0.6 —0.45 | +1.15 | +1.5 +0.97 | +0.80 
Ne re ey yr re rere 2.53 2.16 2.27 2.91 3.13 4.06 
(e) By sex groups. 
Boys (5A). 
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The pupils were further divided into groups on the basis of the intelli- 
gence quotient. The lowest IQ was fifty-eight, the highest one 
hundred thirty-six. Naturally the greatest number of pupils occur 
in the average group, ninety to one hundred tenIQ. The final group- 
ing shown in Table III is that on the basis of sex. Boys and girls 
were divided and the results treated separately for the two grades. 

As indicated by the results obtained, the following conclusions seem 
warranted in reference to this particular experiment: 

1. Order of arrangement of items has in general very little effect 
upon pupils’ scores in power (scaled) tests in problems, spelling, and 
arithmetic fundamentals. 

2. On the basis of percentage of error very few items reveal impor- 
tant differences in amount of error. Differences which do occur are 
random, that is, no one order is responsible for the greatest number of 
errors, for the maximum percentages of error are distributed about 
equally among the three orders of arrangement. 

3. On the basis of differences in changes in scores, problems, spell- 
ing, and arithmetic fundamentals are consistent in showing only slight 
variations. The general tendency is for order I to be slightly easier 
than order II or order III; such differences in favor of order I were 
statistically significant (difference in means as much as four times the 
probable error of the difference) only once, between the easy-to-hard 
and random orders for problems in Grade 8B. No other differences 
were statistically significant for any of the tests. When the ‘‘differ- 
ence scores” are handled on the basis of the chronological age or mental 
maturity of the pupils, mean differences among groups are in most 
cases slight. Where any especially notable differences among groups 
occur, no définite tendency is discernible. When the intelligence 
quotient determines the group into which a pupil falls, the small group 
of children comprising those with IQ less than eighty are most affected 
by a change from easy-to-hard order in problems, spelling, and funda- 
mentals. The few differences occurring among the four groups of 
higher degree of intelligence are in no general direction; most differ- 
ences among the four higher groups are unimportant. There are no 
sex differences in the effect of order of arrangement of items in spelling 
and arithmetic fundamentals; boys and girls in Grade 8B are affected 
in the same way by order in problems, but with 5A pupils there is 
shown a slight tendency for the hard-to-easy and random arrangements 
to be easier for the boys and harder for the girls than the easy-to-hard 
arrangement. | 
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4. Although mean differences are small, there is considerable 
scattering above and below the mean, indicating that individual 
pupils may be considerably influenced by the factor of order of items, 
but this influence is very different in the case of different pupils. The 
difference in effect in individual cases is manifest in any of the special 
groups, CA, MA, IQ, and sex, and from case studies of paired pupils. 


IMPLICATIONS 


The presence of such differences makes it especially questionable 
whether any one order of arrangement could be advocated with any 
degree of conviction. From these results it would not seem justifiable 
to require that teachers arrange in order of approximate difficulty 
items in informal classroom exercises and tests which the majority 
of the children would be expected to finish in the time allotted. Fur- 
thermore no apparent harm would result were problems of various 
degrees of difficulty included in an activity or life interest unit. 

The usual procedure is to urge further experimentation with ever 
larger numbers of pupils; on the contrary it is here suggested that 
further investigation to be most profitable might concentrate upon a 
small, carefully selected group of from forty to fifty pupils. With the 
small group very careful analyses of the individual papers would be 
possible, and the technique of personally interviewing the pupils to 
discover the fundamental causes for changes in scores would help to 
determine to what degree order of arrangement of items might be the 
factor causing change. 
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THE ESSAY VERSUS THE OBJECTIVE EXAMINATION 
AS MEASURES OF THE ACHIEVEMENT OF 
BI-LINGUAL CHILDREN 


FLOYD F. CALDWELL 
State Teachers College, Chico, California 


AND 


MARY DAVIS MOWRY 
New Mexico State Teachers College, Silver City, New Mexico 


In this study a comparison is made between (1) the scores earned 
on examinations of the essay-type and examinations of the objective- 
type by Spanish-American children, (2) the scores earned on the same 
tests by Anglo-American children, and, (3) the relative achievement 
of Spanish-American and Anglo-American children as it is measured 
by the two types of tests. One of the primary purposes of the study 
is to determine whether the use of new-type tests increases or decreases 
the language handicap of Spanish speaking children. 

In various studies where standardized tests have been employed, 
the Spanish-Americans have been found to test consistently lower in 
intelligence than English-speaking Americans. If lack of mental 
ability is the important contributing cause of the low standing of Span- 
ish-American children and language handicap is an insignificant 
influence, it seems reasonable to expect that the scores which they 
obtain on the objective and essay-type examinations would behave 
in much the same relative manner as those obtained by the Anglo- 
Americans who have little or no language handicap. If, on the con- 
trary, it is‘found that Spanish-American pupils rank relatively much 
lower on the essay type than on the objective type tests, other things 
being as nearly equal as possible, it becomes apparent that difficulty of 
expression does operate to handicap these children. 

The term Spanish-American is used throughout this study to desig- 
nate all pupils who gave their nationality as ‘‘ Mexican” or “‘Spanish.”’ 
The term Anglo-American is used in the usual sense. No attempt has 
been made to limit the testing program to include Spanish-Americans 
from Spanish-speaking homes, nor in any way to determine the degree 
of blood. Some pupils stated that they were of mixed parentage. 
Doubtless, however, a number of children from Spanish mothers and 
Anglo fathers considered themselves Anglo-American. It was assumed 


that for those who failed to record Spanish or Mexican blood the 
696 
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language difficulty caused negligible differences in scores. If such 
difficulty existed to any extent for such children, then the conclusions 
drawn from a study of the data presented would tend to minimize 
rather than exaggerate the true differences in achievement. 

In order not to interfere with the class-time organization, a number 
of short tests were constructed of both types. The test results of like 
types of examinations were then thrown together into the equivalent 
of one long test and the various measures computed. A total of 
six hundred twenty-three children were tested and the results from 
four thousand six hundred forty-six tests were used as a basis for the 
conclusions drawn. Pupils were tested in the fields of English and 
History. 

Each objective test was accompanied by a test of the essay type 
over the same material, and in each case it was possible for a pupil 
to score an equal number of points on the two. Since it was desirable 
that the pupils tested should respond to an equal number of items on 
the two kinds of tests, questions in the essay examinations were 
detailed. This reduced the amount of writing required of the pupil on 
the essay examination to less than the ordinary amount required on 
tests of this particular type. Here again if differences do exist in the 
scores obtained on the two types of tests used in this particular study, 
these differences should be smaller than would ordinarily be the case 
with the ordinary run of classroom essay tests. 

Corey’s! study indicates that when carefully constructed the essay 
and the objective examination measure very nearly the same thing 
when little or no language handicap exists. Similarly, it was hoped 
that in this study the tests were so constructed that they measure 
equally well a knowledge of subject-matter and the degree to which 
the children were familiar with the textbooks in use. 

The tests were administered by the regular classroom teachers 
under normal classroom conditions. Each child was given a mimeo- 
graphed copy of the test and was told to follow directions carefully. 
Each teacher was supplied with a complete set of directions. The 
objective tests were given first. If an appreciable practice effect 
occurs, the essay test results would be the ones effected. Liberal 
time allowance tended to minimize the speed factor. Both the objec- 
tive and essay tests were checked and graded twice, the essay tests 
being checked by a second grader. Precautions were taken so that the 





1 Corey, Stephen M.: Correlation between New-type and Essay Examination 
Scores. School and Society, Vol. XXXII, pp. 849-850. 
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graders did not know whether the papers being graded belonged to the 


Spanish-American or to the Anglo-American children. 

In comparing the achievement of the two groups on the English 
tests it is found that the results substantiate the findings of other 
investigators in the field in that the Spanish-Americans test relatively 
lower in mixed groups. The difference is more marked, however, on 
the essay than on the objective tests. 

As is to be expected, some overlapping exists between the two 
groups. Analysis of the data reveals the fact that, on the average, 
twice as large a proportion of Spanish-American pupils appear above 
the upper decile on objective tests than on the essay tests; while the 
opposite is true for the lowest decile. In like manner, many more of 
these pupils fall above the median of the Anglo-American group on the 
new-type than on the old-type tests. The median score of Anglo- 
Americans exceeds that of the Spanish-Americans for both tests. This 
condition is more pronounced in grades four and five. Table I gives 
the amount of overlapping of scores obtained on the English tests. 


TaBLE I].—PERCENTAGE OVERLAPPING OF SCORES OF THE Two GROUPS ON THE 
ENG.LIsH TEsTs 








Percentage 
Percentage of that medium 
Percentage of|Percentage of} Spanish- Of en Sa 
Spanish- Spanish- American point 
Grade Test American American above ‘aed 
above ninety| below ten medium of sal f 
percentile percentile Anglo- . eral 
American <a — 
erican 
III Objective. ... 8.11 8.11 51.35 2.7 
i an at 5.41 10.82 16.22 8.9 
IV Objective. ... 5.22 10.00 45.00 28.5 
A 2.50 22.50 22.50 35.1 
V Objective 4.60 11.49 25.28 27.3 
Essay........ 4.60 13.61 20.69 49.0 
VI Objective... . 1.53 9.23 32.31 15.8 
Essay........ 3.07 13.85 20.00 27.3 
VII Objective. ... 6.78 5.09 37.29 11.3 
Essay........ 1.62 16.95 37.29 20.6 
VIII | Objective.... 3.33 10.00 23.30 22.9 
a 0.00 20.00 16.67 30.9 
Average | Objective.... 4.93 8.99 32.98 18.1 
Essay........ 2.87 16.29 23.33 28.6 
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From the tabulated results in Table II it can be readily seen that 
the differences which exist between scores earned by the Spanish- 
American children and Anglo-American children are always more 
marked on the essay tests. For instance, in the third grade the differ- 
ence in mean scores for the objective test is only 1.14 with a PE,,,, of 
0.96, a critical ratio of 1.19 and seventy-nine chances in one hundred 
that the difference is significant. On the essay tests the difference is 
3.55, PEs, 0.81, a critical ratio of 4.38, and one hundred chances in 
one hundred that there is a true difference greater than zero. For the 
seventh grade the differences in means are 8.08 and 17.57, respectively. 
However, here the probable errors of difference are considerably 
greater so that the critical ratios are not quite as large as the differ- 
ences alone would indicate. Nevertheless, the difference between a 
critical ratio of 1.80 and 3.80 is certainly marked. Approximately 
the same conditions maintain for all the grades concerned. 


COMPARISON OF ACHIEVEMENT IN HISTORY 


Since no formal history was taught in the third and fourth grades 
of the schools concerned, there was little uniformity of subject-matter 
employed in instruction. Therefore tests were only given in grades 
five to eight inclusive. 

Results of the examinations in History show a resemblance to those 
on the English tests. That is, (1) the two groups still differ in achieve- 
ment as measured by both the objective and the essay type of tests; 
(2) least agreement in scores is again found in grades five and six; 
(3) Spanish-American scores sometimes increase on the essay tests, 
but never to the extent that the Anglo-American scores are raised; 
and, (4) chances are greater that a significant difference exists between 
mean and median scores of the two groups on essay than on objective 
tests. A notable difference lies in the fact that results on the History 
tests show the mean and median Anglo-American scores in History 
proportionately far above the mean and median Spanish-American 
scores, a condition that does not exist to such a marked degree in 
English. 

The median scores of the Anglo-American children on objective 
and essay tests in English exceed those of the Spanish-American 
children by 18.1 per cent and 28.6 per cent, respectively, while the 
median scores of the Anglo-Americans on the two types of tests in 
History exceed the scores of the Spanish-Americans by 53.5 per cent 
on objective tests and 75.7 per cent on essay tests. Obviously, the 
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Taste I].—Megan anp MeEpIAN DiFFERENCES, PROBABLE ERRORS OF THE 
DIFFERENCES, THE CRITICAL RATIOS AND THE SIGNIFICANCE OF THE 
DIFFERENCES BETWEEN Scores EARNED ON ENGLISH TESTS BY 
SPANISH-AMERICAN AND ANGLO-AMERICAN CHILDREN 















































ting. Thus relatively lower performance on History tests is probably 
to be expected of the Spanish-American children who do not have as 
yet sufficient control of English to study, read, and respond with jus- 
tice to their true intellectual capacity. Tables III and IV give the 
various computed measures for the scores obtained on the History 
tests. 


; ... , |Chances 
Grade Measure Test “aaa PE att, cae in one 

hundred 
III Mean......... Objective. .... 1.14] 0.96 1.19 79 
y Essay......... 3.55 | 0.81 | 4.38 | 100 

a Median....... Objective. .... 1.89 | 1.20 1.58 86 ; 

Ve Essay......... 2.67] 1.01 | 2.64 96 
ie IV as oi atae Objective..... 7.68 | 3.38 2.27 94 
b te , Essay eeecceeee 12.90 3.91 3.30 99 
Hi . Median....... Objective..... 12.50 | 4.24 2.95 98 
if Essay......... 14.61 | 4.95 | 2.95 98 
cai V MMM ec: Objective..... 5.76 | 1.05 | 5.49 | 100 
ate Eesay......... 7.15 |} 1.19 5.92 100 
ED 3 Median....... Objective..... 8.25 1.32 6.25 100 
Bh ee 12.03 | 1.49 8.07 100 
| 5 2 VI RS Objective. .... 6.19 | 1.73 3.52 99 
ate Essay......... | 9.79 | 1.57 | 6.22 | 100 
ac Median....... Objective. .... 5.25 | 2.17 | 2.42 95 
el Essay......... 8.73 | 1.96 | 4.45 | 100 
ae VII ee Objective..... 8.08 | 4.50 1.80 89 
i Essay......... 17.57 | 4.62 | 3.80 99 
Mais: Median....... Objective..... 7.25 | 5.63 1.29 81 
-, Essay......... 14.44] 5.78 | 2.50 95 
ae VIII | Mean......... Objective. .... 8.82 | 2.53 | 3.49 99 
. Essay......... 9.99] 2.08 | 4.80 | 100 
\ i Median....... Objective. .... 6.90 | 3.17 2.18 93 
' a Essay......... 9.50 | 2.61 3.64 99 

ey ; . } Spanish-Americans tested are more adept in the use of our language 

“tie when taking the English tests than in taking the History tests. This is 

eon | probably due to the fact that the History tests require a transfer of 

;. a: application into another field. This no doubt makes the language 

a handicap more pronounced, for when the child learns a new word in 

t ii one situation he may fail to recognize it when it appears in a new set- 
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TaBLE II].—PeRcentTaGe OVERLAPPING oF SCORES OF THE Two Groups ON THE 
History Trsts 








Percentage 
‘ Percentage of| +14¢ medium 
ercentage of|Percentage of} Spanish- f Anglo- 
Spanish- | Spanish- | American pes ! 
Grade Test American American above pe 
above ninety| below ten | medium of pr f 
percentile percentile Anglo- er esa 
American Spanish- 
American 
V Objective. ... 6.98 17.44 25.58 73.3 
Essay........ 2.33 18.60 20.93 119.1 
VI Objective 1.30 9.08 22 .37 104.4 
cS 1.30 18.42 21.52 126.8 
VII Objective 10.70 10.70 38.33 18.8 
Essay........ 5.08 16.95 33.90 27.3 
VIII | Objective.... 6 .67 13.34 33.33 17.4 
Eesay....... ; 3.33 16.67 6.67 29.7 
Average | Objective. ... 6.41 12.64 29.90 53.5 
ics mens 3.01 17.66 20.77 75.7 




















TaBLE IV.—Meran anp MepiaAn DIFFERENCES, PROBABLE ERRORS OF THE 
DIFFERENCES, THE CRITICAL RATIOS AND THE SIGNIFICANCE OF THB 
DIFFERENCES BETWEEN Scores EarRNED ON History TESTS BY 


SPANISH-AMERICAN AND ANGLO-AMERICAN CHILDREN 

















: a Chances 
Grade Measure Test nie PEaitr. — in one 
ence ratic |) undred 
V ee ae Objective. .... 26.50 | 4.03 6.58 100 
re 42.47 | 3.77 | 11.26 100 
Median....... Objective. .... 40.33 | 5.04 8.00 100 
essa 0 6a 51.33 | 4.71 | 11.00 100 
VI MR ecccces Objective..... 20.60 | 3.28 6.28 100 
ae 23.70 | 3.31 7.16 100 
Median....... Objective. .... 23.75 | 4.10 5.79 100 
ET 27.90 | 4.14 6.74 100 
VII ee Objective..... 1.29 | 2.86 45 62 
re 9.54! 2.63 3.78 99 
Median....... Objective..... 6.00 | 3.58 1.68 87 
SE ca dk ows 9.56 | 3.29 2.91 97 
— fet eee Objective. .... 7.50 | 2.02 3.71 100 
ae 11.20 | 1.78 6.29 100 
Median....... Objective..... 6.00 | 2.52 2.34 94 
SE 9.90 | 2.23 4.04 100 
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CONCLUSIONS AND IMPLICATIONS 


These analyses of the data indicate, (1) that language difficulty 
operates to penalize the Spanish-American pupils when either the 
objective or the essay type of examination is used as an instrument 
for the measurement of achievement; (2) there is considerably more of 
ie” a handicap experienced with the essay than with the objective type. 
a ) This is, in all probability, due to the fact that the essay type of exam- 
:: ination demands a “recall” of vocabulary, whereas, the objective 
examination requires largely a ‘‘recognition” of unfamiliar words: 
' (3) There is a greater handicap experienced by the Spanish-American 
children when tests are given in History than when they are given in 
English. This may be due to the fact that the application of English 
ib in a situation other than in the field of English might tend to increase 
| } the language difficulty. In other words, responding in the field of 
f 
Z 











History requires a more definite transfer of language information and 
application. When there are such large differences existing between 
the results obtained on the objective and the essay types of examina- 
tion and when evidence points to the fact that this difference is largely 
due to language handicap, there seems to be a strong probability that 
the same factor may account for the relatively low standing of Spanish- 
speaking children on the so-called intelligence tests. 

A suggestion here might be advisable also regarding another possi- 
ble implication. The fact has been quite firmly established that more 
learning occurs when practiced with satisfaction. It seems quite 
reasonable to assume that when the Spanish-speaking child knows that 
his scores are very inferior to those of other children he is likely to 
become discouraged and learning will thereby be retarded. This 
study has indicated that the Spanish-speaking child tests relatively 
considerably higher on the objective tests than on the essay. Under 
these circumstances it seems that the teacher should employ the objec- 
tive type of test frequently to insure for this language handicapped 
child some measure of satisfaction in accomplishment. Thus she will 
be employing a more satisfactory and accurate measuring instrument, 
and be approaching more nearly a true score of accomplishment. At 
the same time learning will be facilitated and attitudes of inferiority, 
sullenness and other undesirable mental products will be far less apt 
to develop. 





THE MEANING OF AN AVERAGE 









































Ity WARD H. TAYLOR 
the Employment Stabilization Research Institute, University of Minnesota 
wo I. THE CLASSICAL MEANS 
pe. In spite of the flood of books on statistics designed to serve primar- 
m- ily those who work in the fields of education, economics, sociology, and 
ive psychology, there still remains a consistent vagueness with reference 
1s: to the true significance of an average. In the opinion of the writer, 
an this situation is a direct result of the tendency to take over, in some- 
in what uncritical fashion, the formulas and definitions (so-called) found 
sh in current textbooks in mathematics. 
se If one were to follow current practice, he would say that there 
of are many kinds of averages, and proceed to enumerate them, even 
nd going so far as to illustrate their differences by applying the various 
en formulas to the same set of numbers (say two and eight, for example). 
a- Thus, the arithmetic mean of two and eight is five, the geometric mean 
ly of two and eight is four, and the harmonic mean of two and eight is 
at 3.2. But what is it all about? Not many students in advanced statis- 
h- tics seem to be able to tell. This probably is not the fault of the 
students. 
‘i- The reason that students generally have no fair comprehension of 
re these averages is that the averages are not really defined, except by 
te implication. What is done is to tell how the different averages are 


at computed. We say that the arithmetic mean of two numbers is half 

‘0 their sum, the geometric mean is the square root of their product, and 

is the harmonic mean is the reciprocal of the arithmetic mean of their : 
y reciprocals. This may all be true, but what of it? What concepts | 

r are given? What similarities are observed, and what distinctions . 

- made? \ 
d Suppose, on the other hand, that one were to approach the problem 

ll by indirection, as it were. One might even be so unscientific as to 

a develop the concept in connection with specific illustrations. For 

t example, let us define the average daily wage rate in a steel mill as a 3 
r, uniform daily wage rate such that the total payroll remains unchanged. . 
t For simplicity, suppose there are two workers, one of whom (unskilled) | 


receives two dollars a day, and the other of whom (highly skilled) 
receives eight dollars a day. What is the average rate of pay? 


Let z equal the number of dollars in the uniforro rate. 
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Then 
2x = 10, 

and 

z= 65. 


Generalizing for the case of only two workers, whose rates of pay 
are respectively a dollars and b dollars, 


2x = a+b, 
and 
= tt. 


Thus with a definition that means something to the man in the mill 
as well as to the student in statistics, we prove that, consistent with our 
first definition, the average of two numbers is half their sum. 

Again, let us define the average annual rate of growth in population 
of a booming town as a uniform rate such that the resulting population 
at the end of a definite period of years remains unchanged. For 
simplicity, suppose that during the first of two years the town octuples 
its population, and that during the second year it doubles its popula- 
tion. What is the average annual rate of growth of population? 

Let a equal the number of people at the beginning of the first boom 
year. Then 8a equals the number of people at the end of the first 
boom year; and 16a equals the number of people at the end of the 
second boom year. | 

Now, let x equal the uniform rate of growth. Then az equals the 
number of people at the end of the first boom year; and az? equals the 
number of people at the end of the second boom year. 


*,az* = l6a, 
xz? = 16, 
and 
a= 4, 


Generalizing for the case of only two years, in which the respective 
rates of growth are k and l, 


ax? = akl, 
xz? = kl, 


and 


z= /kl. 
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Thus with a definition that means something to the taxpayer in the 
town, as well as to the student of statistics, we prove that, consistent 
with our second definition, the average of two numbers is the square 
root of their product. 

Finally, if the time required by one pipe to fill a cistern differs from 
that required by another, let us define the average time required by the 
two pipes as a uniform time such that if both pipes flow for that length 
of time the total flow of water will remain unchanged. (Whether they 
flow simultaneously or seriatim matters not in the least.) For sim- 
plicity, suppose one pipe fills a cistern in two hours, and another fills 
a cistern of the same size in eight hours. What is the average time 
required to fill a cistern of that size? 

Let x equal the uniform number of hours required. Now one pipe 


- fills one-half of a cistern in an hour, and the other fills one-eighth of a 


cistern in an hour. Therefore 


:,s 
5+ 3 = 4 
4r +2 = 16, 
5z = 16, 

and 
xz = 3.2. 


(It should be remarked in passing that for the two pipes to fill one 
cistern therefore requires 1.6 hours. The truth of this is readily veri- 
fied. This is the classical problem in “mental arithmetics.”’) 

Generalizing for the case of only two pipes, which require respec- 
tively a and b hours to fill the cistern, we have 





E+5 = 2, 
bz + ax = 2ab, 
and 
ce 2ab 
a+b 


Changing the form somewhat, 


= + = 2, 
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Thus with a definition which means something to the user of cistern 
water, as well as to the student in statistics, we prove that, consistent 
with our third definition, the average of two numbers is the reciprocal 
of the arithmetic mean of the reciprocals of those numbers. 

By these simple illustrations, we are thus led to see that the three 
classical means, designated respectively as the arithmetic, the geo- 
metric, and the harmonic, are really all based on the same fundamental 
idea, that of a uniform rate. The character of the data determines the 
variations in form of computation. 


II, AVERAGES AS APPLIED TO RATES OF WORK 


To some readers it may appear that in the first two illustrations 
we have obtained an average rate, whereas in the third illustration 
the average obtained is not a rate, but a period of time. Perhaps the 
following discussion will clarify that point. 

1. There is a functional relationship between time expended and 
work accomplished, which leads to the concept of rate. The simplest 
treatment is based on an assumption of constant rate. This assump- 
tion is usually tacit, but it seems best to state it explicitly in this 
discussion. 

2. Rate may be expressed equally well in terms of the number of 
units of work accomplished, per unit of time expended, or in terms of 
the number of units of time expended, per unit of work accomplished. 
These two methods must express one and the same rate; that is to say, 
they must be consistent, one with the other. 

3. The average rate defined in terms of the number of units of 
work accomplished, per unit of time expended, must be consistent with 
the average rate defined in terms of the number of units of time 
expended, per unit of work accomplished. This may well be called 
the principle of consistency of average rates. Its application is by no 
means confined to the work-time relationship. 
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4. Historically, and perhaps logically—most certainly from a 
practical point of view—this principle is maintained by defining the 
average rate as a uniform rate such that the total amount of work 
accomplished remains unchanged. On this basis, the arithmetic mean 
results if the rate is expressed in terms of the number of units of work 
accomplished, per unit of time expended, but the harmonic mean 
results if the rate is expressed in terms of the number of units of time 
expended, per unit of work accomplished. 

Consistency could be obtained by exactly reversing the situation; 
that is, by defining the average in such a way that the total amount 
of time expended remains unchanged. In such case, the arithmetic 
and harmonic means would appear in exactly the reverse order. 
However, this is contrary to well established conventions of speech 
and thought. Jn no case can consistency be obtained by the use of one 
mean to the exclusion of the other. 

5. This principle of consistency of average rates has a very prac- 
tical application in the experimental field. In some instances it is 
convenient to express rates in terms of the number of units of work 
accomplished, per unit of time expended, whereas in other cases it is 
more convenient to express rates in terms of the number of units of 
time expended, per unit of work accomplished. In the former case, 
the arithmetic mean should consistently be employed, but in the latter 
case the harmonic mean should consistently be employed. In no 
other way (except by reversing the two) can two sets of data thus 
differently recorded be put on a comparable basis. 

In practice, it would seem advisable in every case to convert 
time-for-work data into comparable work-for-time data. A table of 
reciprocals of the natural numbers offers a convenient means to that 
end. There have been empirical studies made which seem to indicate 
that distributions thus obtained are more nearly symmetrical, and 
that correlation coefficients are favorably influenced. Regardless of 
whether or not the resultant coefficients are less or more favorable to 
the desires of an investigator, it would appear certain that correlation 
coefficients obtained from data differently recorded must necessarily 
be subject to an unwarranted bias. It would further seem that the 
implications involved might, with great profit to everybody con- 
cerned, be subjected to critical analysis by someone in the field of 
pure mathematics. 
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W. N. and L. A. Ketitoae. The Ape and the Child. New York: 
McGraw-Hill, 1933. Pp. XIV + 341. 


“The Wild Boy of Aveyron” and other ‘animal children” are 
cases of humans adapting themselves to the lives of animals; Dr. and 
Mrs. Kellogg reversed the process and adapted a chimpanzee to a 
normal human environment. A baby female chimpanzee, borrowed 
from the Abreu Colony in Cuba, was reared with their only son 
Donald a3 a member of their household. The chimpanzee, Gua, was 
744 months old when adopted into their household on June 26, 1931, 
and remained a member of it until March 28, 1932—a period of nine 
months. Donald was 244 months older. As far as possible the 
treatment accorded Gua was exactly the same as that which was given 
to Donald. She ate the same food, was dressed in the same clothes, 
slept in the same kind of bed, and was wheeled in the same perambu- 
lator as Donald. Dr. Kellogg, however, made himself the ‘‘ Mother” 
of Gua, while Mrs. Kellogg mothered Donald. The tests and measure- 
ments of the two youngsters took up such a large part of each day that 
to this extent the environment was abnormal for both. 

The results of this magnificent experiment are difficult to interpret. 
Gua, being a chimpanzee, was stronger, grew and developed more 
rapidly, had a differently shaped hand and exhibited greater sensory 
acuities, especially in the fields of smell and hearing. Instead of com- 
paring Gua with Donald she should have been compared sometimes 
with a human subject of twice her age, sometimes with one of six or 
eight times her age. In addition, the natural dependence for security 
on a grown-up caused Gua to exhibit frightful emotional outbursts 
when deprived of Dr. Kellogg’s company and protection. It is very 
doubtful if Gua ever regarded Donald as another ape, although Donald, 
so far as could be observed, regarded Gua as another human being. 

As would almost be expected, Gua had a better memory for food 
and places than Donald, and, through her greater strength, learned to 
open doors by twisting the handle earlier than Donald. What is 
rather astonishing is that Gua made almost as great a score as Donald 
in Dr. Gesell’s tests, and was superior in learning control of the blad- 
der, in skipping, in drinking from a glass, and in eating with a spoon. 
The ape’s rate of learning was on the whole faster than the human’s. 

Where Gua fell down was in language usage, though not in lan- 


guage comprehension. Dr. Kellogg vainly tried to teach her to say 
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“Papa.” Gua never played with her voice-as Donald did, and this 
failure to use language, in the last analysis, explains the bridge between 
humans and chimpanzees. Otherwise, pound for pound weight, the 
chimpanzee was easily the superior organism at the age she was studied. 
Too much credit cannot be given Dr. and Mrs. Kellogg for their 
bold and successful experiment, and for the calm, dispassionate way 
they have recorded the results. If they got as much entertainment 
and instruction in carrying out the experiment as their readers have in 
reading about it, they will feel rewarded for a very strenuous nine 
months’ work. P. SANDIFORD. 
University of Toronto. 


MarGARET Meap. The Changing Culture of an Indian Tribe. New 
York: Columbia University Press, 1932. Pp. 313. 


Dr. Mead has acquitted her task nobly in this monograph. Built 
upon direct personal experience of some summers ago among a tribe of 
Plains Indians, this book stands as an exceptionally intelligent and 
well-written account, which is more than can be said for most of the 
field-studies, which are usually either so dull as to be harrowing or so 
sensationally superficial as to be suspicious as to verity. The impact 
of highly cultured groups upon more primitive societies is a process 
which sums up much of the outstanding importance of recent world 
history, and yet in spite of its constant and worldwide occurrence, few 
intimate and trustworthy accounts exist of the processes of culture- 
contact. Dr. Mead’s work fills that long-felt need. 

Dr. Mead states the importance of this study of human societies in 
disequilibrium in these words: “‘ Difficult to control, difficult to dupli- 
cate in the experience of the student, too aberrant to make plausible a 
prediction of its exact occurrence, too disorganized and complicated to 
provide a satisfactory study, it nevertheless should serve toilluminate 
the social process, to give the type of understanding which springs 
from the very characteristic which makes it in other respects so unsatis- 
factory—distortion.”’ 

The economic life, the political situation, social organization, 
religious attitudes and the educational situation are all described 
with a fund of first-hand information indisputably valid and pertinent 
to the problem in hand. Dr. Mead made a special effort to gain some 
conception of the réle played by the Indian women in the process of 
cultural disorganization. Their efforts to control the household 
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economy in the midst of perturbations of all sorts such as poverty, 
influx of foreign instruments and contacts present the spear-head of 
the invasion of the superior culture. Within the disintegrating social 
structure, the individual ‘‘is left floundering in a heterogeneous welter 
of meaningless, uncoérdinated and disintegrating institutions.”’ 

There is a series of tabular and diagrammatic appendices which 
furnish the raw material as it was gathered for this account. An 
excellent index completes an otherwise excellent work throughout. 


NaTHAN MILLER. 
Carnegie Institute of Technology. 


J. M. Remvparpt and G. R. Davies. Principles and Methods of 
Sociology. New York: Prentice-Hall, 1932. Pp. 685. 


This book typifies as a textbook the sprawling nature and protean 
aspects of the study known as “‘sociology.”’ Deriving its most basic 
postulates from other fields as economics and its technique from widely 
varying disciplines, it seeks to build up a unique and embracing study 
of society, but with very few exceptions as yet, the attempt does not 
quite come off. To say that in this text, sociology does not succeed in 
emerging as a closely-knit and autonomous subject, is not to condemn 
the work utterly, because it does represent an able summation and out- 
line of the material usually offered to students of this ‘‘subject.”’ 
There are closely-reasoned and cogent but brief summaries of views 
upon population problems, race, the family and the other nuclei in this 
broad canvas. Least satisfactory, perhaps are the chapters devoted to 
a discussion of the ‘‘social process” or the psychological aspects of 
society. Ambiguous generalizations and fuzzy notions have always 
characterized the field and most of them have been taken over bodily. 
It is due to the inadequacy of this tendency in American sociology, 
that the discipline has acquired in the minds of so many co-workers in 
social science, a deep reputation for amateurishness and naivete— 
which is justly deserved. There is a constant emphasis on statistical 
techniques as illustrating and implementing the “principles” of the 
text. Statistics in the social sciences evidence the desire for a ‘‘ quanti- 
tative” approach to vexing problems, but there is still deep scepticism 
prevailing as to the relevance of the statistical approach to sociology. 
However, the statistical summary at the end of the volume should be 
of enormous utility to beginners in the subject. The institutional 
aspects, in which in our mind, the real substance of “‘sociology”’ is to 
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be found are hastily and somewhat inadequately surveyed, and it is 
hard to see how a beginning student can help but be confused by the 
material assembled in such a hasty way. The chapter on the family, 
for instance, attempts a gigantic task and is hardly justified as a single 
chapter. In short, in its faults, this text epitomizes the worst and the 
best of most introductory courses in sociology, and until some con- 
census is arrived at, this subject will still remain as the weakest sister 
in the filed of the so-called social sciences. NaTHAN MILLER. 
Carnegie Institute of Technology. 


EDWARD SAFFORD JONES. Comprehensive Examinations in American 
Colleges. New York: The Macmillan Co., 1933. Pp. 436. 


The Association of American Colleges, through a grant from the 
General Education Board, instituted in 1931 an investigation of the 
use made of comprehensive examinations in American colleges. Dr. 
Edward 8S. Jones of the University of Buffalo was appointed as the 
director of the study. The present volume is the official report of the 
investigation. 

The report is both descriptive and quantitative. In the descrip- 
tive portion of the report such problems as the following are discussed: 
Types of examination questions, origin of comprehensive exam‘na- 
tions, relation to honors courses, administrative difficulties, and 
improvement of examinations. Quantitative studies based on the 
results of personal interviews, check lists, and written comments 
include the following: Attitudes of teachers, students, and alumni 
toward comprehensive examinations; current practices in colleges; and 
reputed values of both comprehensive examinations and honors 
‘courses. 

From evidence presented, the use of comprehensive examinations 
appears to have been stimulated by the introduction of honors courses 
into American colleges. From 1900 to 1920 less than ten institutions 
from among the six hundred fifty-four circularized are reported to 
have used comprehensive examinations. In 1933 the number using 
such examinations is reported as eighty-five and some colleges included 
all or part of the student body in addition to honors students in the 
examination program. The comprehensive examination is rapidly 
becoming a part of the scheme of college education. 

Those faculties which have used comprehensive examinations have 
encountered numerous problems relating to examination technique. 
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The impression is gained that few college teachers are satisfied with 
either the essay type or the oral type examination question. Nor 
is marked enthusiasm shown for the new type objective short answer 
question. Obviously one of the obstacles retarding the extension of 
the comprehensive examination movement is this lack of agreement on 
what constitutes good examination technique. This difficulty is 
discussed in the report and specific suggestions for improvement of 
examination procedures are made. 

The descriptive sections of the report appear to be biased by the 
author’s point of view which is stated in the following manner: ‘‘A 
final comprehensive examination should never be thought of in and 
by itself. It is part of a program of education, often radically different 
from the ordinary.” This approach causes the presentation to become 
considerably involved at times. Despite this limitation the report 
contains much timely material of interest to the critical reader but 
persons looking for a guide-book on the preparation of comprehensive 
examinations will be disappointed. GLEN U. CLEETON. 

Carnegie Institute of Technology. 


FREDERICK H. Lunp. Psychology: An Empirical Study of Behavior. 
New York: Ronald Press, 1933. Pp. XV + 475. 


In the preface to this work Dr. Lund states ‘‘ As may be seen, the 
material and ideas presented in the text have been drawn from many 
sources—too many, in fact, to make appropriate acknowledgments 
possible. Wherever quotations or illustrations have been borrowed, 
however, credit is given on the pages on which they occur.” 

The first part of this statement is true and those who are acquainted 
with the works of Woodworth, Dashiell, Hollingworth, Starch and 
Sandiford can trace the borrowed sections quite easily. If the author 
had added organization to the phrase ‘‘ material and ideas” the state- 
ment would have been even more true. The second part of the state- 
ment is open to question. Certain illustrations have apparently been 
ingeniously manipulated and photographed without acknowledgment 
of their origin. But the work has value. If one is going to borrow at 
all, it is wise to borrow from the works of successful authors. 


PETER SANDIFORD. 
University of Toronto. 
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